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This issue of the Digital Technical Journal presents 
papers from three companies — Cray Research, 
Raytheon, and Kubota Graphics — that have devel- 
oped high-performance systems based on the 
Alpha AXP 64-bit microprocessor. Also included 
here are papers about the Alpha AXP chip sets for 
building PCI-based systems and on the compression 
technique used in the DLT2000 tape product. 

Cray Research, the parallel vector processor and 
supercomputing pioneer, has developed its first 
massively parallel processor (MPP) for customers 
who seek the price/performance advantages of an 
MPP design. As Kent Koeninger, Mark Furtney, and 
Martin Walker explain, Cray's MPP uses hundreds 
of fast commercial microprocessors, in this case 
Digital's OECchip 21064; whereas a parallel vector 
processor uses dozens of custom (more expensive) 
vector processors. Their paper focuses on the CRAY 
T3D system — an MPP designed to enable a wide 
range of applications to sustain performance levels 
higher than those attained on parallel vector pro- 
cessors. The authors review major system aspects, 
including the programming model, the 3-D torus 
interconnect, and the physically distributed, logi- 
cally shared memory. 

For the U.S. military, Raytheon has designed an 
extended environment, commercial off-the-shelf 
(E 2 COTS) computer based on the DECchip 21066/68 
AXPvme 64 board. Bob Couranz discusses the char- 
acteristics of the E 2 COTS board that provide the mil- 
itary with cost and performance advantages. He 
describes how designers addressed the military's 
reliability requirements, one of which is computer 
operation in a wide temperature range of —54 
degrees C to 85 degrees C. Packaging modifications 
made by Raytheon include reconfiguration of the 
module board for conduction cooling as opposed to 
the convection cooling of the commercial product. 



Kubota Graphics' advanced 3D imaging and 
graphics accelerator is used in Digital's DOC 3000 
AJpha AXP workstations and in Kubota's work- 
stations. Ron Levine's paper interweaves a descrip- 
tion of the Kubota accelerator product with a 
tutorial on imaging, graphics, and volume render- 
ing. He begins by distinguishing between imaging 
and graphics technologies and their relationship to 
volume rendering methods. He then reviews appli- 
cation areas, such as medical imaging and seismic 
exploration, and expands on volume rendering 
techniques. The final section addresses the Kubota 
implementation, the first desktop-level system to 
provide interactive volume rendering. 

Digital encourages broad industry application 
of the Alpha AXP family of microprocessors. 
Sam Nadkarni, Walker Anderson, Lauren Carlson, 
Dave Kravitz, Mitch Norcross, and Tom Wenners 
describe the chip sets — one cost focused and one 
performance focused — system designers can use to 
easily build PCI-based AJpha AXP 21064 systems. 
The authors also present an overview of the EB64+ 
evaluation kit. This companion to the chip sets 
gives designers sample designs and an evaluation 
platform which allows them to quickly evaluate the 
cost and performance implications of their design 
choices. 

The state-of-the-art DLT2000 tape drive offers 
high data throughput, up to 3M bytes/s, and high 
data capacity, up to 30G bytes (compressed). David 
Cressman outlines the product issues that drove the 
DLT2000 development and then details the devel- 
opers' investigation of the performance impact to 
the tape drive design of two different data compres- 
sion algorithms, the Lempel-Ziv algorithm and the 
Improved Data Recording Capability (IDRC) algo- 
rithm. He reviews the tests conducted to measure 
compression efficiency and data throughput rates. 
The test results, unexpected by developers, reveal 
that the design using Lempel-Ziv compression gen- 
erally achieves higher storage capacity and data 
throughput rates than the lDRC-based design. 
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I Foreword 




Scott A. Gordon 

Manager 

Strategic Programs, 
Semiconductor Operations 



Early in the development of the Alpha program, 
Digital's management put forward a strategic direc- 
tion that would significantly shape the application 
and reach of Alpha AXP technology in the market. 
That direction was to make AJpha AXP technol- 
ogy "open" In making the technology open, Digital 
sought to provide a broader and richer set of prod- 
ucts than the company could provide by itself 
and in so doing extend the range of AJpha AXP tech- 
nology and the competitiveness of AJpha AXP 
products in the market. This represented a signifi- 
cant departure from the operating business model 
of Digital's successful VAX business, where the tech- 
nology was proprietary to Digital. Accordingly, the 
AJpha program required significant changes to pre- 
vious business practices. Ongoing interaction with 
customers and business partners helped shape and 
clarify these changes. The resulting initiative to 
make the Alpha AXP technology open consisted of 
three primary components: 

1. Digital would sell Alpha AXP technology at all 
levels of integration — chip, module, system. 

2. Digital would provide open licensing of Alpha 
AXP technology. 

3. Digital would work closely with partners to 
extend the range of AJpha AXP technology and 
products in the market. 

The first key element in opening the Alpha AXP 
technology was the decision to sell the technology 
at all levels of integration. With access to the tech- 
nology at multiple levels of integration, customers 
and business partners can focus on their own devel- 
opment or application areas of expertise and 



extend AJpha AXP technology to new products or 
markets in ways that most effectively meet their 
own business needs. The three papers from Cray 
Research, Raytheon, and Kubota in this issue of the 
Digital Technical Journal are good examples of 
utilizing and extending the range of Alpha AXP tech- 
nology from three different levels ol integration. 

The CRAY T3D massively parallel processor (MPP) 
system utilizes Alpha AXP technology at the chip 
level. Building on the performance leadership of 
the AJpha AXP microprocessor, Cray Research 
focused on key areas in the development of a leader- 
ship MPP system — communication and memory 
interconnect, packaging, and the programming 
model and tools. 

Starting with Digital's AXPvme 64 module, 
Raytheon adapted it to meet the extended environ- 
mental and reliability requirements for defense 
application. By starting with an existing module 
design, Raytheon was able to maintain software 
compatibility with commercial AJpha AXP systems, 
thus providing a very cost-effective way of deploy- 
ing advanced Alpha AXP computer technology in 
a military environment. 

Lastly, starting from the system level, Kubota 
developed an advanced 3D imaging and graphics 
accelerator for Digital's DEC: 3000 AXP workstation 
systems. Using the basic system capabilities of the 
workstation, Kubota's 3D imaging and graphics 
accelerator extends the range of the Alpha AXP 
technology to high-performance medical imaging, 
seismic, and computational science applications — 
even to the realm of virtual reality games. 

The decision to sell at all levels of integration 
meant that Digitals Semiconductor Operations 
moved from being a captive supplier of micro- 
processor and peripheral support chips exclusively 
for Digital's systems business to being an open 
merchant supplier. Concurrently, it also meant an 
expansion of Digital's OEM business at the module 
and system level. Whereas the business infrastruc- 
ture was already in place for Digital to expand the 
board and systems OEM business, some changes 
were required to meet the needs of external chip 
customers in ways different from those established 
with Digital's internal systems groups. Previously, 
technical support was provided informally, chip 
designer to system designer, while the devel- 
opment tools and supporting peripheral chips 
required for designing-in the microprocessor were 
often developed uniquely by the system group 
itself. Along with the marketing and application 
support resources required to support Digitals 
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Semiconductor Operations as a merchant supplier, 
a full range of hardware and software development 
tools and supporting peripheral chips needed to 
be developed to support the family of Alpha AXP 
microprocessors for external customers. The fourth 
paper in this issue describes part of this "whole 
product" solution developed for the DECchip 21064 
microprocessor — the PCI core logic chip set and an 
evaluation board kit. Together, the chip set and the 
evaluation board kit (which includes OSF/1 or 
Windows NT software tools) provide customers the 
ability to develop Alpha AXP PCI systems with mini- 
mal design and engineering effort. 

A second fundamental element in opening the 
AJpha AXP technology to the broad marketplace 
was to openly license the technology A critical 
requirement of both chip customers and potential 
partners was that Alpha AXP microprocessors be 
available from a second source to (1) ensure their 
security of supply and (2) extend the range of chip 
implementations to broaden the markets served 
by the AJpha AXP technology. This is the basis for 



the Alpha AXP semiconductor partnership with 
Mitsubishi Electric Corporation announced in 
March 1993. Mitsubishi plans to begin supplying 
Alpha AXP microprocessors based on 0.5-micron 
technology to the open market by the end of 1994. 
In addition to licensing the chip and architec- 
ture, Digital also licenses other elements of the 
Alpha AXP technology to meet the needs of our cus- 
tomers and partners, including Digital's OSF/1 UNIX 
operating system. 

With access at all levels of integration and 
through open licensing, Digital sought and estab- 
lished multiple partner and customer relationships 
to extend the range of Alpha AXP technology and 
products in the market. From portable computing 
to supercomputing, from embedded applications to 
complete system solutions, over seventy-five com- 
panies are currently using Alpha AXP technology in 
their products. This issue of the Digital Technical 
Journal provides a sampling of the ever-broadening 
set of Alpha AXP products and applications enabled 
through open access to the technology. 
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R. Kent Koeninger 
Mark Furtney 
Martin Walker 



A Shared Memory MPP from 
Cray Research 

The CRAY T3D system is the first massively parallel processor from Cray Research 
The implementation entailed the design of system software, hardware, languages, 
and tools. A study of representative applications influenced these designs. The 
paper focuses on the programming model, the physically distributed, logically 
shared memory interconnect, and the integration of Digital's DECchip 21064 
Alpha AXP microprocessor in this interconnect. Additional topics include latency- 
hiding and synchronization hardware, libraries, operating system, and tools. 



Today's fastest scientific and engineering com- 
puters, namely supercomputers, fall into two basic 
categories: parallel vector processors (PVPs) and 
massively parallel processors (MPPs). Systems in 
both categories deliver tens to hundreds of bil lions 
of floating-point operations per second (GFLOPS) 
but have memory interconnects that differ signifi- 
cantly. After presenting a brief introduction on 
PVPs to provide a context for MPPs, this paper 
focuses on the design of MPPs from Cray Research. 

PVPs have dominated supercomputing design 
since the commercial success of the CRAY-1 super- 
computer in the 1970s. Modern PVPs, such as the 
CRAY C90 systems from Cray Research, continue to 
provide the highest sustained performance on a 
wide range of codes. As shown in Figure 1, PVPs use 
dozens of powerful custom vector processors on a 
high-bandwidth, low-latency, shared-memory inter- 
connect. The vector processors are on one side 
of the interconnect with hundreds to thousands of 
memories on the other side. The interconnect has 
uniform memory access, i.e., the latency and band- 
width are uniform from all processors to any word 
of memory. 

MPPs implement a memory architecture that is 
radically different from that of PVPs. MPPs can 
deliver peak performance an order of magnitude 
faster than PVP systems but often sustain perfor- 
mance lower than PVPs. A major challenge in MPP 
design is to enable a wide range of applications to 
sustain performance levels higher than on PVPs. 



The work described in this paper was partially supported 
by the Advanced Research Projects Agency under Agreement 
No. MDA972-92-0002 dated January 21, 1992. 



MPPs typically use hundreds to thousands of fast 
commercial microprocessors with the processors 
and memories paired into distributed processing 
elements (PEs). The MPP memory interconnects 
have tended to be slower than the high-end PVP 
memory interconnects. The MPP interconnects 
have nonuniform memory access, i.e., the access 
speed (latency and bandwidth) from a processor to 
its local memory tends to be faster than the access 
speed to remote memories. 

The processing speed and memory bandwidth of 
each microprocessor are substantially lower than 
those of a vector processor. Even so, the sum of the 
speeds of hundreds or thousands of microproces- 
sors can often exceed the aggregate speed of dozens 
of vector processors by an order of magnitude. 
Therefore, a goal for MPP design is to raise the effi- 
ciency of hundreds of microprocessors working in 
paral lei to a point where they perform more useful 
work than can be performed on the traditional PVPs. 
Improving the microprocessor interconnection 
network will broaden the spectrum of MPP applica- 
tions that have faster times-to-solution than on PVPs. 

A key architectural feature of the CRAY T3D sys- 
tem is the use of physically distributed, logically 
shared memory (distributed-shared memory). The 
memory is physically distributed in that each PE 
contains a processor and a local dynamic random- 
access memory (DRAM); accesses to local memory 
are faster than accesses to remote memories. The 
memory is shared in that any processor can read or 
write any word in any of the remote PEs without the 
assistance or knowledge of the remote processors 
or the operating system. Cray Research provides 
a shell of circuitry around the processor that allows 
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Figure 1 Memory Interconnection Architectures 



the local processor to issue machine instructions to 
read remote memory locations. Distributed-shared 
memory is a significant advance in balancing the 
ratio between remote and local memory access 
speeds. This balance, in conjunction with new pro- 
gramming methods that exploit this new capability, 
will increase the number of applications that can 
run efficiently on MPPs and simplify the program- 
ming tasks. 

The CRAY T3D design process followed a top- 
down flow. Initially, a small team of Cray Research 
applications specialists, software engineers, and 
hardware designers worked together to conduct 
a performance analysis of target applications. The 
team extracted key algorithmic performance traits 
and analyzed the performance sensitivity of MPP 
designs to these traits. This activity was accom- 
plished with the invaluable assistance and advice of 
a select set of experienced MPP users, whose insights 
into the needs of high-performance computing pro- 
foundly affected the design. The analysis identified 
key fundamental operations and hardware/software 
features required to execute parallel programs with 



high performance. A series of discussions on 
engineering trade-offs, software reusability issues, 
interconnection design studies and simulations, 
programming model designs, and performance 
considerations led to the final design. 

The resulting system architecture is a distributed 
memory, shared address space, multiple instruction, 
multiple data (MIMD) multiprocessor. Special 
latency-hiding and synchronization hardware facili- 
tates communication and remote memory access 
over a fast, three-dimensional (3-D) torus intercon- 
nection network. The majority of the remote mem- 
ory accesses complete in less than 1 microsecond, 
which is one to two orders of magnitude faster than 
on most other MPPs. 1 2 3 

A fundamental challenge for the CRAY T3D sys- 
tem (and for other MPP systems) is usability. By defi- 
nition, an MPP with high usability would sustain 
higher performance than traditional PVP systems 
for a wide range of codes and would allow the pro- 
grammer to achieve this high performance with a 
reasonable effort. Several elements in the CRAY T3D 
system combine to achieve this goal. 
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■ The distributed-shared memory interconnect 
allows efficient, random, single-word access 
from any processor to any word of memory. 

■ Cray's distributed memory, Fortran program- 
ming model with implicit remote addressing is 
called CRAFT. It provides a standard, high-level 
interface to this hardware and reduces the effort 
needed to arrive at near-optimum performance 
for many problem domains. 4 

■ The heterogeneous architecture allows problems 
to be distributed between an MPP and its PVP 
host, with the highly parallel portions on the MPP 
and the serial or moderately parallel portions on 
the PVP host. This heterogeneous capability 
greatly increases the range of algorithms that will 
work efficiently It also enables stepwise MPP pro- 
gram development, which lets the programmer 
move code from the PVP to the MPP in stages. 

■ The CRAY T3D high-speed I/O capabilities pro- 
vide a close coupling between the MPP and the 
PVP host. These capabilities sustain the thou- 
sands of megabytes per second of disk, tape, and 
network I/O that tend to accompany problems 
that run at GFLOPS. 

The remainder of this paper is divided into four 
sections. The first section discusses the results of 
the applications analysis and its critical impact on 
the CRAY T3D design, including a summary of criti- 
cal MPP functionality The second section charac- 
terizes the system software. The software serves 
multiple purposes; it presents the MPP functional- 
ity to the programmer, maps the applications to the 
hardware, and serves as the interface to the scien- 
tist. In the third section, the hardware design is laid 
out in some detail, including microprocessor selec- 
tion and the design issues for the Cray shell cir- 
cuitry that surrounds the core microprocessor and 
implements the memory system, the interconnec- 
tion network, and the synchronization capabilities. 
The fourth section presents benchmark results. 
A brief summary and references conclude the paper. 

The Impact of Applications on Design 

As computing power increases, computer simula- 
tions increasingly use complex and irregular geom- 
etries. These simulations can involve multiple 
materials with differing properties. A common trend 
is to improve verisimilitude, i.e., the semblance of 
reality, through increasingly accurate mathematical 
descriptions of natural laws. 



Consequently, the resolution of models is 
improving. The use of smaller grid sizes and shorter 
time scales resolves detail. Models that use irregular 
and unstructured grids to accommodate geome- 
tries may be dynamically adapted by the computer 
programs as the simulation evolves. The algorithms 
increasingly use implicit time stepping. 

A naive single instruction, multiple data (SIMD) 
processor design cannot efficiently deal with the 
simulation trends and resulting model characteris- 
tics. Performing the same operation at each point 
of space in lockstep can be extremely wasteful. 
Dynamic methods are necessary to concentrate the 
computation where variables are changing rapidly 
and to minimize the computational complexity. 
The most general form of parallelism, MIMD, is 
needed. In a MIMD processor, multiple independent 
streams of instructions act on multiple indepen- 
dent data. 

With these characteristics and trends in mind, 
the design team chose the kernels of a collection of 
applications to represent target applications for the 
CRAY T3D system. The algorithms and computa- 
tional methods incorporated in these kernels were 
intended to span a broad set of applications, includ- 
ing applications that had not demonstrated good 
performance on existing MPPs. These kernels 
included seismic convolution, a partial multigrid 
method, matrix multiplication, transposition of 
multidimensional arrays, the free Lagrange method, 
an explicit two-dimensional Laplace solver, a conju- 
gate gradient algorithm, and an integer sort. The 
design team exploited the parallelism intrinsic to 
these kernels by coding them in a variety of ways to 
reflect different demands on the underlying hard- 
ware and software. For example, the team gener- 
ated different memory reference patterns ranging 
from local to nearest neighbor to global, with regu- 
lar and irregular patterns, including hot spots. (Hot 
spots can occur when many processors attempt to 
reference a particular DRAM page simultaneously.) 

To explore design trade-offs and to evaluate practi- 
cal alternatives, the team ran different parallel imple- 
mentations of the chosen kernel on a parameterized 
system-level simulator. The parameters character- 
ized machine size, the nature of the processors, 
the memory system, messages and communication 
channels, and the communications network itself. 
The simulator measured rates and durations of 
events during execution of the kernel implementa- 
tions. These measurements influenced the choices 
of the hardware and the programming model. 
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The results showed a clear relationship between 
the scalability of the applications and the speed of 
accessing the remote memories. For these algo- 
rithms to scale to run on hundreds or thousands of 
processors, a high-bandwidth, low-latency inter- 
proccssor interconnect was imperative. This find- 
ing led the designers to choose a distributed-shared 
memory, 3-D torus interconnect with very fast 
remote memory access speeds, as mentioned in the 
previous section. 

The study also indicated that a special program- 
ming model would be necessary to avoid remote 
memory accesses when possible and to hide the 
memory latency for the remaining remote accesses. 
This finding led to the design of the CRAFT pro- 
gramming model, which uses hardware in the inter- 
connect to asynchronously fetch and store data 
from and to remote PEs. This model helps program- 
mers to distribute the data among the shared mem- 
ories and to align the work with this distributed 
data. Thus, they can minimize remote references 
and exploit the locality of reference intrinsic to 
many applications. 

The simulations also showed that the granularity 
of parallel work has a significant impact on both 
performance and the ease of programming. Per- 
forming work in parallel necessarily incurs a work- 
distribution overhead that must be amortized by 
the amount of work that gets done by each proces- 
sor. Fine-grained parallelism eases the program- 
ming burden by allowing the programmer to avoid 
gathering the parallel work into large segments. As 
the amount of work per iteration decreases, how- 
ever, the relative overhead of work distribution 
increases, which lowers the efficiency of doing 
the work in parallel. Balancing these constraints 
contributed to the decisions to include a variety of 
fast synchronization mechanisms, such as a separate 
synchronization network to minimize the overhead 
of fine-grained parallelism. 

Software 

Cray Research met several times a year with a group 
of experienced MPP users, who indicated that soft- 
ware on existing MPPs was unstable and difficult to 
use. The users believed that Cray Research needed 
to provide clear mechanisms for getting to the raw 
power of the underlying hardware while not diverg- 
ing too far from existing programming practices. 
The users wished to port codes from workstations, 
PVTs, and other MPPs. They wanted to minimize 
the porting effort while maximizing the resulting 



performance. The group indicated a strong need 
for stability, similar to the stability of existing CRAY 
Y-MP systems. They emphasized the need to pre- 
serve their software investments across generations 
of hardware improvements. 

Reusing Stable Software 

To meet these goals, Cray Research decided to reuse 
its existing supercomputing software where possi- 
ble, to acquire existing tools from other MPPs 
where appropriate, and to write new software 
when needed. The developers designed the operat- 
ing system to reuse Cray's existing UNICOS oper- 
ating system, which is a superset of the standard 
UNIX operating system. The bulk of the operating 
system runs on stable PVP hosts with only microker- 
nels running on the MPP processors. This design 
enabled Cray Research to quickly bring the CRAY 
T3D system to market. The resulting system had a 
minimal number of software changes and retained 
the maximum stability and the rich functionality 
of the existing UNICOS supercomputing operating 
system. The extensive disk, tape, and network I/O 
capabilities of the PVP host provide the hundreds of 
megabytes per second of I/O throughput required 
by the large MPPs. This heterogeneous operating 
system is called UNICOS MAX. 

The support tools (editors, compilers, loaders, 
debuggers, performance analyzers) reside on the 
host and create code for execution on the MPP 
itself. The developers reused the existing Cray 
Fortran 77 (CF77) and Cray Standard C compilers, 
with modified front ends to support the MPP pro- 
gramming models and with new code generators 
to support the DECchip 21064 Alpha AXP micropro- 
cessors. They also reused and extended the heart of 
the compiling systems— the dependency-graph- 
analysis and optimization module. 

The CRAFT Programming Model 
The CRAFT programming model extends the Fortran 
77 and Fortran 90 languages to support existing 
popular MPP programming methods (message pass- 
ing and data parallelism) and to add a new method 
called work sharing. The programmer can combine 
explicit and implicit interprocessor communication 
methods in one program, using techniques appro- 
priate to each algorithm. This support for existing 
MPP and PVP programming paradigms eases the 
task of porting existing MPP and PVP codes. 

The CRAFT language designers chose directives 
such that codes written using the CRAFT model run 
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correctly on machines that do not support the 
directives. CRAFT-derivcd codes produce identical 
results on sequential machines, which ignore the 
CRAFT directives. Exceptions are hardware limita- 
tions (e.g., differing floating-point formats), non- 
deterministic behavior in the user's program (e.g., 
timing-dependent logic), and the use of MPP- 
specific intrinsic functions (i.e., intrinsics not avail- 
able on the sequential machines). 

A message-passing library and a shared memory 
access library (SMAL) provide interfaces for explicit 
intcrprocessor communication. The message- 
passing library is Parallel Virtual Machine (PVM), 
a public domain set of portable message-passing 
primitives developed at the Oak Ridge National 
Laboratory and the University of Tennessee. 5 The 
widely used PVM is currently available on all Cray 
systems. SMAL provides a function call interface to 
the distributed-shared memory hardware. This pro- 
vides a simple interface to the programmer for 
shared memory access to any word of memory in 
the global address space. These two methods pro- 
vide a high degree of control over the communica- 
tion but require a significant programming effort; 
a programmer must code each communication 
explicitly. 

The CRAFT model supports implicit data-parallel 
programming with Fortran 90 array constructs and 
intrinsics. Programmers often prefer this style 
when developing code on SIMD MPPs. 

The CRAFT model provides an additional implicit 
programming method called work sharing. This 
method simplifies the task of distributing the data 
and work across the PEs. Programmers need 
not explicitly state which processors will have 
which specific parts of a distributed data array. 
Similarly, they need not specif) 7 which PFs will 
perform which parts of the work. Instead, they 
use high-level mechanisms to distribute the data 
and to assist the compiler in aligning the work 
with the data. This technique allows the program- 
mers to maximize the locality of reference with 
minimum effort. 

In work sharing, programmers use the SHARED 
directives to block the data across the distributed 
memories. They distribute work by placing DO 
SHARED directives in front of DO loops or by using 
Fortran 90 array statements. The compiler aligns 
the work with the data and doles out each iteration 
of a loop to the PE where most of the data associ- 
ated with the work resides. Not all data needs to be 
local to the processor. 



The hardware and the programming model can 
accommodate communication-intensive programs. 
The compiler attempts to prefetch data that resides 
in remote PEs, i.e., it tends to copy remote data to 
local temporaries before the data is needed. By 
prefetching multiple individual words over the fast 
interconnect, the compiler can mask the latency of 
remote memory references. Thus, locality of refer- 
ence, although still important, is less imperative 
than on traditional MPP systems. The ability to fetch 
individual words provides a very fine-grained com- 
munication capability that supports random or 
strided access to remote memories. 

The programming model is built on concepts 
that are also available in Fortran D, Vienna Fortran, 
and the proposed High-performance Fortran (HPF) 
language definition. <rs (Cray Research participates 
in the HPF Forums.) These models are based on 
Mehrotra's original Kal i language definition and on 
some concepts introduced for the ILLIAC IV parallel 
computer by Millstein. 9 - 1 " 

Libraries 

Libraries for MPP systems can be considered to con- 
sist of two parts: CI) the system support libraries for 
I/O, memory allocation, stack management, mathe- 
matical functions (e.g., SIN and COS), etc., and (2) the 
scientific libraries for Basic Linear Algebra Subrou- 
tines (BLAS), real and complex fast Fourier trans- 
forms, dense matrix routines, structured sparse 
matrix routines, and convolution routines. Cray 
Research used its current expertise in these areas, 
plus some third-party libraries, to develop high- 
performance MPP libraries with all these capabilities. 

Tools 

A wide variety of support tools is available to aid 
application developers working on the CRAY T3D 
system. Included in the Cray tool set are loaders, 
simulators, an advanced emulation environment, a 
full-featured MPP debugger, and tools that support 
high-level performance tuning. 

Performance Analysis A key software tool is the 
MPP Apprentice, a performance analysis tool based 
in part on ideas developed by Cray Research for 
its ATFxpert tool." The MPP Apprentice tool has 
expert system capabilities to guide users in evaluat- 
ing their data and work distributions and in sug- 
gesting ways to enhance the overall algorithm, 
application, and program performance. 
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The MPP Apprentice processes compiler and run- 
time data and provides graphical displays that relate 
performance characteristics to a particular subpro- 
gram, code block, and line in the users original 
source code. The user can select a code block and 
obtain many different kinds of detailed information. 
Specific information on the amount of each type 
of overhead, such as synchronization constructs 
and communication time, let the user know pre- 
cisely how and where time is being spent. The user 
can see exactly how many floating-point instruc- 
tions, global memory references, or other types of 
instructions occur in a selected code block. 

Debugging Cray Research supplies the Cray 
TotalView tool, a window-oriented multiprocessor 
symbolic debugger based on the TotalView product 
from Bolt Beranek and Newman fnc. The Cray 
TotalView tool is capable of debugging multiple- 
process, multiple-processor programs, as well as 
single-process programs, and provides a large reper- 
toire of features for debugging programs written in 
Fortran, C, or assembly language. 

An important feature of the debugger is its 
window-oriented presentation of information. 
Besides displaying information, the interface allows 
the user to edit information and take other actions, 
such as modifying the values of the variables. 

The debugger offers the following full range of 
functions for controlling processes: 

■ Set and clear breakpoints (at the source or 
machine level) 

■ Set and clear conditional breakpoints and evalu- 
ation points 

■ Start, stop, resume, delete, and restart processes 

■ Attach to existing processes 

■ Examine core files 

■ Single step source lines through a program, 
including stepping across function calls 

Emulator Cray Research has implemented an 
emulator that allows the user to execute MPP pro- 
grams before gaining access to a CRAY T3D system 
by emulating CRAY T3D codes on any CRAY Y-MP sys- 
tem. The emulator supports Fortran programs that 
use the CRAFl model, including message-passing 
and data-parallel constructs, and C programs that 
use message passing. Hccause it provides feedback 
on data locality, work distribution, program 



correctness, and performance comparisons, the 
emulator is useful for porting and developing new 
codes for the CRAY T3D system. 

Hardware 

A macro- and microarchitecture design was chosen 
to resolve the conflict of maximizing hardware per- 
formance improvements between generations of 
MPPs while preserving software investments. This 
architecture allows Cray Research to choose the 
fastest microprocessor for each generation of Cray 
MPPs. The macroarchitecture implements the mem- 
ory system and the interconnection network with 
a set of Cray proprietary chips (shell circuitry) 
that supports switching, synchronization, latency- 
hiding, and communication capabilities. The macro- 
architecture will undergo only modest changes over 
a three-generation life cycle of the design. Source 
code compatibility will be maintained. The micro- 
architecture will allow the instruction set to change 
while preserving the macroarchitecture. 

Macroarchitecture 

The CRAY T3D macroarchitecture has characteris- 
tics that are both visible and available to the pro- 
grammer. These characteristics include 

■ Distributed memory 

■ Global address space 

■ Fast barrier synchronization, e.g., forcing all pro- 
cessors to wait at the end of a loop until all other 
processors have reached the end of the loop 

■ Support for dynamic loop distribution, e.g., dis- 
tributing the work in a loop across the proces- 
sors in a manner that minimizes the number of 
remote memory references 

■ Hardware messaging support 

■ Support for fast memory locks 

Memory Organization 

The CRAY T3D system has a distributed-shared 
memory built from DRAM parts. Any PE can directly 
address any other PE's memory, within the con- 
straints imposed by security and partitioning. The 
physical address of a data element in the MPP has 
two parts: a PE number and an offset within the PE, 
as shown in Figure 2. 

CRAY T3D memory is distributed among the 
PEs. Each processor has a favored low-latency, 
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high-bandwidth path to its local memory and a 
longer-latency, lower-bandwidth path to memory 
associated with other processors (referred to as 
remote or global memory). 

Data Cache The data cache resident on Digital's 
DECchip 21064 AJpha AXP microprocessor is a write- 
through, direct-mapped, read-allocate cache. CRAY 
T3D hardware does not automatically maintain the 
coherence of the data cache relative to remote mem- 
ory. The CRAFT programming model manages this 
coherence and guarantees the integrity of the data. 

Local and Remote Memory Each PE contains 16 
or 64 megabytes of local DRAM with a latency of 13 
to 38 clock cycles (87 to 253 nanoseconds) and 
a bandwidth of up to 320 megabytes per second. 
Remote memory is directly addressable by the pro- 
cessor, with a latency of 1 to 2 microseconds and 
a bandwidth of over 100 megabytes per second (as 
measured in software). All memory is directly 
accessible; no action is required by remote proces- 
sors to formulate responses to remote requests. 
The total size of memory in the CRAY T3D system is 
the number of PEs times the size of each PE's local 
memory In a typical 1,024-processor system, the 
total memory size would be 64 gigabytes. 

3-D Torus Interconnection Network 
The CRAY T3D system uses a 3-D torus for the inter- 
connection network. A 3-D torus is a cube with the 
opposing faces connected. Connecting the faces 
provides dual paths (one clockwise and one coun- 
terclockwise) in each of the three dimensions. 
These redundant paths increase the resiliency of 
the system, increase the bandwidth, and shorten 
the average distance through the torus. The three 



dimensions keep the distances short; the length of 
any one dimension grows as the cube root of the 
number of nodes. (See Figure 3 ) 

When evaluated within the constraints of real- 
world packaging limits and wiring capabilities, the 
3-D torus provided the highest global bandwidth 
and lowest global latency of the many interconnec- 
tion networks studied.' 2 s Using three dimensions 
was optimum for systems with hundreds or thou- 
sands of processors. Reducing the system to two 
dimensions would reduce hardware costs but 
would substantially decrease the global bandwidth, 
increase the network congestion, and increase the 
average latency. Adding a fourth dimension would 
add bandwidth and reduce the latency, but not 
enough to justify the increased cost and packaging 
complexity. 

Network Design 

The CRAY T3D network router is implemented 
using emitter-coupled logic (ECL) gate arrays with 
approximately 10,000 gates per chip. The router is 
dimension sliced, which results in a network node 
composed of three switch chips of identical 
design — one each for X-, Y-, and Z-dimension rout- 
ing. The router implements a dimension-order, 
wormhole routing algorithm with four virtual chan- 
nels that avoid potential deadlocks between the 
torus cycle and the request and response cycles. 

Every network node has two PEs. The PEs are 
independent, having separate memories and data 
paths; they share only the bandwidth of the net- 
work and the block transfer engine (described in 
detail later in the paper). A 1,024-PE system would 
therefore have a 512-node network configured as 
a 3-D torus with XYZ dimensions of 8 X 8 X 8. 
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Figure 3 CRAY T3D System 



The network moves data in packets with payload 
sizes of either one or four 64-bit words. Efficient 
transport of single-word payloads is essential for 
sparse or strided access to remote data, whereas 
the 4-word payload minimizes overhead for dense 
data access. 

For increased fault tolerance, the CRAY T3D sys- 
tem also provides spare compute nodes that are 
used if nodes fail. There are two redundant PEs for 
every 128 PEs. A redundant node can be electroni- 
cally switched to replace a failed compute node by 
rewriting the routing tag lookup table. 

Latency of the switch is very low. A packet enter- 
ing a switch chip requires only 1 clock cycle (6.67 
nanoseconds at 150 megahertz [MHz]) to select its 
output path and to exit. The time spent on the phys- 
ical wires is not negligible and must also be 
included in latency calculations. In a CRAY T3D sys- 
tem, all network interconnection wires are either 
1 or 1.5 clock cycles long. Each hop through the 
network requires 1 clock cycle for the switch plus 
1 to 1.5 clock cycles for the physical wire. Turning a 
corner is similar to routing within a dimension. The 
time required is 3 clock cycles: 1 clock cycle inside 



the first chip, 1 clock cycle for the connection 
between chips, and 1 clock cycle for the second 
chip, after which the packet is on the wires in the 
next dimension. 

The result is an interconnection network with 
low latency. As stated previously in the Memory 
Organization subsection, the latency for a 1,024-PE 
system, including the hardware and software over- 
head, is between 1 and 2 microseconds. 

Each channel into a switch chip is 16 bits wide 
and runs at 150 MHz, for a raw bandwidth of 300 
megabytes per second. Seven channels enter and 
seven channels exit a network node: one channel to 
and one channel from the compute resource, i.e., 
the pair of local PEs, and six two-way connections 
to the nearest network neighbors in the north, 
south, east, west, up, and down directions. All four- 
teen channels are independent. For example, one 
packet may be traversing a node from east to west 
at the same time another packet is traversing the 
same node from west to east or north to south, etc. 

The bandwidth can be measured in many ways. 
For example, the bandwidth through a node is 4.2 
gigabytes per second (300 megabytes per second 
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times 14). A common way to measure system band- 
width is to bisect the system and measure the 
bandwidth between the two resulting partitions. 
This bisection bandwidth for a L024-PK CRAY T3D 
torus network is 76 gigabytes per second. 

Microarchitecture — 
The Core Microprocessor 

The CRAY T3D system employs Digital s DECchip 
21064 Alpha AXP microprocessor as the core of the 
processing element. Among the criteria for choos- 
ing this reduced instruction set computer (RISC) 
microprocessor were computational performance, 
memory latency and bandwidth, power, schedule, 
vendor track record, cache size, and programm- 
ability. Table 1, the Alpha Architecture Reference 
Manual, and the DECchip 2106 4- AA Micropro- 
cessor Hardware Reference Manual provide details 
on the Alpha AXP microprocessor. 12 b 

For use in a shared address space MPP, all com- 
mercially available microprocessors contempora- 
neous with the DECchip 21064 device have three 
major weaknesses in common: 14 

1 . Limited address space 

2. Little or no latency-hiding capability 

3. Few or no synchronization primitives 



These limitations arise naturally from the desk- 
top workstation and personal computer envi- 
ronments for which microprocessors have been 
optimized. A desktop system has a memory that is 
easily addressed by 32 or fewer bits. Such a system 
possesses a large board-level cache to reduce the 
number of memory references that result in the long 
latencies associated with DRAjM. The system usually 
is a uniprocessor, which requires little support tor 
multiple processor synchronization. Cray Research 
designed a shell of circuitry around the core 
DFCchip 21064 Alpha AXP microprocessor in the 
CRAY T3D system to extend the microprocessors 
capabilities in the three areas. 

Address Extension 

The Alpha AXP microprocessor has a 43-bit vir- 
tual address space that is translated in the on-chip 
data translation look-aside buffer (DT13) to a 34-bit 
address space that is used to address physical bytes 
of DRAM. Thirty-four bits can address up to 16 giga- 
bytes (2^ 4 bytes). Since the CRAY T3D system has up 
to 128 gigabytes (2^ bytes) of distributed-shared 
memory, at least 37 bits of physical address are 
required. In addition, several more address bits are 
needed to control caching and to facilitate control 
of the memory-mapped mechanisms that imple- 
ment the external MPP shell. The CRAY T3D system 
uses a 32-entry register set called the DTR Annex to 



Table 1 CRAY T3D Core Microprocessor Specifications 



Characteristic Specification 



Microprocessor 


Digital's DECchip 21064 Alpha AXP microprocessor 


Clock cycle 


6.67 nanoseconds 


Bidirectional data bus 


128 bits data, 28 check bits 


Data error protection 


SECDED 


Address bus 


34 bits 


Issue rate 


2 instructions/clock cycle 


Internal data cache 


8K bytes (256 32-byte lines) 


Internal instruction cache 


8K bytes (256 32-byte lines) 


Latency: data cache hit 


3 clock cycles 


Bandwidth: data cache hit 


64 bits/clock cycle 


Floating-point unit 


IEEE floating-point and floating-point-to-integer 


Floating-point registers 


32 (64 bits each) 


Integer execution unit 


Integer arithmetic, shift, logical, compare 


Integer registers 


32 (64 bits each) 


Integrated circuit 


CMOS, 14.1 mm x 16.8 mm 


Pin count 


431 (229 signal) 


Typical power dissipation 


-23 watts 
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extend the number of physical address bits beyond 
the 34 provided by the microprocessor. 

Shell circuitry always checks the virtual PE num- 
ber. If the number matches that of the local PE, the 
shell performs a local memory reference instead of 
a remote reference. 

Latency -biding Mechanisms 
As with most other microprocessors, the external 
interface of the DECchip 21064 is not pipelined; 
only one memory reference may be pending at any 
one time. Although merely an annoyance for local 
accesses, this behavior becomes a severe perfor- 
mance restriction for remote accesses, with their 
longer latencies, unless external mechanisms are 
added to extend the processor's memory pipeline. 

The CRAY T3D system provides three mecha- 
nisms for hiding the startup time (latency) of 
remote references: (1) the prefetch queue, (2) the 
remote processor store, and (3) the block transfer 
engine. As shown in Table 2, each mechanism has 
its own strengths. The compilers, communication 
libraries, and operating system choose among 
these mechanisms according to the specific remote 
reference requirements. Typically, the prefetch 
queue and the remote processor store arc the most 
effective mechanisms for fine-grained communica- 
tion, whereas the block transfer engine is strongest 
for moving large blocks of data. 

The Prefetch Queue The DECchip 21064 instruc- 
tion set includes an operation code FETCH that per- 
mits a compiler to provide a "hint" to the hardware 
of upcoming memory activity. Originally, the FETCH 
instruction was intended to trigger a prefetch to the 
external secondary cache. The CRAYT3D shell hard- 
ware uses FETCH to initiate a single-word remote 
memory read that will fill a slot reserved by the 
hardware in an external prefetch queue. 



The prefetch queue is first in, first out (FIFO) 
memory that acts as an external memory pipeline. 
As the processor issues each FETCH instruction, the 
shell hardware reserves a location in the queue for 
the return data and sends a memory read request 
packet to the remote node. When the read data 
returns to the requesting processor, the shell hard- 
ware writes the data into the reserved slot in the 
queue. 

The processor retrieves data from the FIFO queue 
by executing a load instruction from a memory- 
mapped register that represents the head of the 
queue. If the data has not yet returned from the 
remote node, the processor will stall while waiting 
for the queue slot to be filled. 

The data prefetch queue is able to store up to 16 
words, that is, the processor can issue up to 16 FETCH 
instructions before executing any load instruc- 
tions to remove (pop) the data from the head of 
the queue. Repeated load instructions from the 
memory-mapped location that addresses the head 
of the queue will return successive elements in the 
order in which they were fetched. 

The Remote Processor Store The DECchip 21064 
stores to remote memory do not need to wait for 
a response, so a large number of store operations 
can be outstanding at any time. This is an effective 
communication mechanism when the producer of 
the data knows which PEs will immediately need to 
use the data. 

The Alpha AXP microprocessor has four 4-word 
write buffers on chip that try to accumulate a cache 
line (4 words) of data before performing the actual 
external store. This feature increases the network 
packet payload size and the effective bandwidth. 

The CRAY T3D system increments a counter in the 
PE shell circuitry each time the DECchip 21064 micro 
processor issues a remote store and decrements the 
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counter each time a write operation completes. For 
synchronization purposes, the processor can read 
this counter to determine when all of its writes 
have completed. 

The Block Transfer Engine The block transfer 
engine (BLT) is an asynchronous direct memory 
access controller used to redistribute data between 
local and remote memory. To facilitate reorganiza- 
tion of sparse or randomly organized data, the BLT 
includes scattcr-gathcr capabilities in addition to 
constant strides. The BLT operates independently 
of the processors at a node, in essence appearing 
as another processor in contention for memory, 
data path, and switch resources. Cray Research has 
a patent pending for a centrifuge unit in the BLT that 
accelerates the address calculations in the CRAFT 
programming model. 

The processor initiates BLT activity by storing 
individual request information (for example, start- 
ing address, length, and stride) in the memory- 
mapped control registers. The overhead associated 
with this setup work is noticeable (tens of micro- 
seconds), which makes the BLT most effective for 
large data block moves. 

Syncbroniza tion 

The CRAY T3D system provides hardware primi- 
tives that facilitate synchronization at various levels 
of granularity and support both control parallelism 
and data parallelism. Table 3 presents the character- 
istics of these synchronization primitives. 

Barrier The CRAY T30 has specialized barrier 
hardware in the form of 16 parallel logical AND 
trees that permit multiple barriers to be pipelined 
and the resource to be partitioned. When all PEs in 
the partition have reached the barrier and have set 
the same bit to a one, the AND function is satisfied 
and the barrier bit in each PE's barrier register is 
cleared by hardware, thus signaling the processors 
to continue. 



The barrier has a second mode, called eureka 
mode, that supports search operations. A eureka is 
simply a logical OR instead of a logical AND and can 
be satisfied by anyone processor. 

The barrier mechanism in the CRAY T3D system is 
quite fast. Even for the largest configuration (i.e., 
2,048 PEs), a barrier propagates in less than 50 clock 
cycles (about 330 nanoseconds), which is roughly 
the latency of a local DRAM read. 

T'elih and Increment The CRAY T3D system has 
specialized fetch-and-increment hardware as part 
of a shared register set that automatically incre- 
ments the contents each time the register is read. 
Fetch-and-increment hardware is useful for dis- 
tributing control with fine granularity. For exam- 
ple, it can be used as a global array index, shared by 
multiple processors, where each processor incre- 
ments the index to determine which element in an 
array to process next. Each element can be guaran- 
teed to be processed exactly once, with minimal 
control overhead. 

Messaging A messaging facility in the CRAY T3D 
system enables the passing of packets of data from 
one processor to another without having an 
explicit destination address in the target PE's mem- 
ory. A message is a special cache-line-size write that 
has as its destination a predefined queue area in the 
memory of the receiving PE. The shell circuitry 
manages the queue pointers, providing flow con- 
trol mechanisms to guarantee the correct delivery 
of the messages. The shell circuitry interrupts the 
target processor after a message is stored. 

Atomic Swap Atomic swap registers are provided 
for the exchange of data with a memory location 
that may be remote. The swap is an atomic opera- 
tion, that is, reading the data from the memory 
location and overwriting the data with the swap 
data from the processor is an indivisible operation. 
As with ordinary memory reads, swap latency can 
be hidden using the prefetch queue. 



Table 3 Synchronization Primitives 

Primitive Granularity Parallelism 

Barrier Coarse Control 

Fetch-and-increment Medium Both 

Lightweight messaging Medium Both 

Atomic swap Fine Data 



I/O 

System I/O is performed through multiple Cray 
high-speed channels that connect the CRAY T3D 
system to a host CRAY Y-MP system or to standard 
Cray I/O subsystems. These channels provide hun- 
dreds of megabytes per second of throughput to 
the wide array of peripheral devices and networks 
already supported on Cray Research mainframes. 
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Cray has demonstrated individual high-speed chan- 
nels that can transfer over 100 megabytes per sec- 
ond in each direction, simultaneously. There are 
two high-speed channels for every 128 processors 
in a CRAY T3D system. 

Benchmark Results 

The following benchmarks show results as of iMay 
1994, six months after the release of the CRAY T3D 
product. The results indicate that in this short span 
of time, the CRAYT3D system substantially outper- 
formed other MPPs. 

As shown in Figure 4, a CRAY T3D system with 
256 processors delivered the fastest execution of 
all eight NAS Parallel Benchmarks on any MPP of any 
size. ls (The NAS Parallel Benchmarks are eight codes 
specified by the Numerical Aerodynamic Simula- 
tion [NAS] program at NASA/Ames Research Center. 
NAS chose these codes to represent common types 
of fluid dynamics calculations.) The CRAY T3D sys- 
tem scaled these benchmarks more efficiently than 
all other MPPs, with near linear scaling from 32 to 



64, 128, and 256 processors. Other MPPs scaled the 
benchmarks poorly. None of these other MPPs 
reported all eight benchmarks scaling to 256 pro- 
cessors, and the scaling reported showed more 
nonlinear scaling than on the CRAY T3D system. 
These benchmark results confirm that the superior 
speed of the CRAYT3D interconnection network is 
important when scaling a wide range of algorithms 
to run on hundreds of processors. 

Note that a 256-processor CRAYT3D system was 
the fastest MPP running the NAS Paral lei Benchmarks. 
Even so, the CRAY C916 parallel vector processor 
ran six of the eight benchmarks faster than the 
CRAY T3D system. The CRAY T3D system (selling for 
about $9 mi 1 1 ion) showed better price/performance 
than the CRAY C916 system (selling for about $27 
million). On the other hand, the CRAY C916 system 
showed better absolute performance. When we 
run these codes on a 512-processor CRAY T3D sys- 
tem (later this year), we expect the CRAY T3D to 
outperform the CRAY C916 system on six of the 
eight codes. 




KEY: 
J CRAY C916 



KERNELS 



APPLICATIONS 



CRAY T3D 256 PEs ^] OTHER MPPs (FASTEST REPORTED ON 64 TO 512 PEs) 



EP EMBARRASSINGLY PARALLEL (TYPICAL OF MONTE CARLO APPLICATIONS) 

FT 3-D FAST FOURIER TRANSFORM PARTIAL DIFFERENTIAL EQUATION (TYPICAL OF "SPECTRAL" CODES) 

MG MULTIGRID (SIMPLIFIED MULTIGRID KERNEL SOLVING A 3-D POISSON PARTIAL DIFFERENTIAL EQUATION) 

IS INTEGER SORT 

CG CONJUGATE GRADIENT (TYPICAL OF A LARGE. SPARSE MATRIX) 

BT BLOCK TRIDIAGONAL (TYPICAL OF ARC3D) 

SP SCALAR PENTADIAGONAL (TYPICALOF ARC3D) 

LU LOWER-UPPER DIAGONAL (TYPICAL OF NEWER IMPLICIT COMPUTATIONAL FLUID DYNAMICS) 



Figure 4 NAS Parallel Benchmarks 
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Heterogeneous benchmark results are also 
encouraging. We bench marked a chemistry appli- 
cation, SI PFRMOLECULE, that simulates an imida- 
zole molecule on a CRAY T3I) system with a CRAY 
Y-MP host. The application was 98 percent parallel, 
with 2 percent of the overall time spent in serial 
code (to diagonalize a matrix). We made a baseline 
measurement by running the program on 64 CRAY 
T3D processors. Quadrupling the number of pro- 
cessors (256 PEs) showed poor scaling — a speedup 
of 1.3 times over the baseline measurement. When 
we moved the serial code to a CRAY Y-MP processor 
on the host, leaving the parallel code on 250 CRAY 
T3D processors, the code ran 3 3 times faster than 
the baseline, showing substantially more efficient 
scaling. Figure 5 shows S UPERMOLIiCU LE bench- 
mark performance results on both homogeneous 
and heterogeneous systems. Ninety-eight percent 
may sound like a high level of parallelism, but after 
dividing 98 percent among 256 processors, each 
processor ran less than 0.4 percent of the overall 
paral lei time. The remaining serial code running on 
a single PE ran five times longer than the distributed 
parallel work, thus dominating the time to solu- 
tion. Speeding up the serial code by running it on a 
faster vector processor brought the serial time in 
line with the distributed-parallel time, improving 
the scaling considerably. 
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The CRAY T3D system demonstrated faster I/O 
throughput than any other iMPP. A 256 -processor 
system sustained over 570 megabytes per second of 
I/O to a disk file system residing on a solid-state 
device on the host. The system sustained over 360 
megabytes per second to physical disks. 

Summary 

This paper describes the design of the CRAY T3D 
system. Designers incorporated applications pro- 
files and customer suggestions into the CRAFT 
programming model. The model permits high- 
performance exploitation of important computa- 
tional algorithms on a massively parallel processing 
system, ('ray Research designed the hardware based 
on the fundamentals of the programming model. 

As of this writing, a dozen systems have shipped 
to customers, with results that show the system 
design is delivering excellent performance. The 
CRAY T30 system is scaling a wider range of codes 
to a larger number of processors and running 
benchmarks faster than other MPPs. The sustained 
I/O rates are also faster than on other MPPs. The sys- 
tem is performing as designed. 
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The E 2 COTS System and 
Alpha AXP Technology: 
The New Computer Standard 
for Military Use 

The translation of Digital products applicable to military application has been 
affected by the DoD's need for lower cost products Products developed for military 
application must retain robust mechanical characteristics; however] each product 
may be tailored to meet government specifications such as mean time between fail- 
ure and temperature range. Design changes for military use have had a beneficial 
second effect. Militarized products may be readily modified to meet a severe indus- 
trial environment that previously could only be accomplished with commercial 
products in special enclosures. As a result of the close cooperation between Digital 
and Raytheon, cost-effective, severe environment products can be provided t% the 
DoD and the industry 



In 1986, the Raytheon Company and Digital Equip- 
ment Corporation entered into a licensing agree- 
ment to equip Raytheon's militarized computer 
system with the best commercial computer tech- 
nology of the time, Digital's VAX processor. The 
agreement had two major objectives. The first was 
to incorporate VAX computer technology into a 
configuration that complied with the government's 
existing military specifications. The second was to 
make the militarized VAX technology available as 
a strictly commercial effort. The concept was not 
unique. The Rolm Corporation had militarized a 
number of the commercial computers designed 
originally by Data General Corporation, and 
Norden Systems, Inc. had militarized and marketed 
Digital's PDP-11 system and earlier VAX processors. 
Under the Raytheon/Digital agreement, the first 
computer converted to a configuration usable by 
the military was the VAX 6200 system. The VAX 6200 
incorporated very large-scale integration (VLSI) 
device technology. 

Prior to the introduction of VLSI technology, 
the militarization of computers was difficult but 
manageable. The military was a major customer of 
semiconductor vendors, who would commonly 
manufacture parts to meet both commercial and 
military standards. The semiconductors, resistors, 



capacitors, switches, and other parts were tested 
and certified to be used in military computers, and 
the mechanical and electrical structure was also 
tested to meet extremes of temperature, shock, and 
vibration. It was, and still is, not unusual to 
encounter a requirement for computer operation 
over the temperature range of — 54 degrees Celsius 
to 70 degrees Celsius with a 30-minute period of 85 
degrees Celsius. 1 In contrast, the commercial units 
operate in a benign office environment of 0 degrees 
Celsius to 50 degrees Celsius. 2 1 

With the evolution of the proprietary VLSI com- 
puter in 1986, the cost of developing a new military 
computer would have strained the governments 
ability to fund the development of modern architec- 
tures to support the advances made in the field of 
software. The funding of new custom VLSI devices 
to become the core of military computers required 
that a large market was available, and the military 
sector offered only a small percentage of the total 
market. 

Military specifications require the costly and 
time-consuming testing and documentation that 
have been in place since World War II. With the end 
of the Cold War and the serious decline of the 
Department of Defense (DoD) budget, the military 
began looking for new ways to procure the weapons 
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systems using VLSI computers. For many new 
procurements, the DoD approach has been to buy 
commercial computers for applications in which 
the environment is expected to be office-lLke. 
The forward edge of the battle area (FEBA), how- 
ever, is anything but office-like and usually presents 
environmental challenges that are not normally 
those anticipated by designers of commercial 
systems. For example, when one thinks of the 
climate conditions encountered in the Gulf War, 
a vision of blowing sand and dry, hot weather 
comes to mind. In reality, the desert sand is a fine 
caustic dust, and the air over the Persian Gulf 
contains significant moisture. The combination is 
lethal to conventionally designed electronic equip- 
ment, etching away unprotected circuit board runs 
and contacts. 

To address the combined budgetary and perfor- 
mance dilemma, Raytheon developed the extended 
environment, commercial off-the-shelf (E 2 COTS) 
computer. To provide the best microprocessor per- 
formance available in 1990 and for the forseeable 
future, the E 2 COTS computer is powered by Digital's 
commercial Alpha AXP technology and is con- 
structed to meet the extended environmental 
needs of defense projects. In addition, that tech- 
nology is made available to the government via 
weapon system integrators as a non-developmental 
item (NDI) and on a strictly commercial basis. As 
a result, the first of the E 2 COTS line, Digital's DEC 
4000 AXP iModel 500 workstation is already flying 
as the Raytheon iModel 920 on the JSTARS aircraft. 

This paper explores some of the changes made in 
the militarization process. It describes the charac- 
teristics of the E 2 COTS computer combined with 
Alpha AXP technology and the versatile micropro- 
cessor (VME) 64 bus. It then discusses the relevance 
of conduction cooling for the militarized module 
and design trade-offs based on space and thermal 
differences. 

Characteristics of an 
E 2 COTS Computer 

There are three major characteristics of an E 2 COTS 
computer with Alpha AXP technology: 

1. It is software identical to the commercial 
equivalent. 

2. The basic commercial design is modified only to 
the extent necessary to meet the extended envi- 
ronmental and reliability requirements of the 
system in which it is employed. 



3. It is tested at the unit level to meet the military 
operational and logistical specifications required 
of the hardware. 

The commercial software (operating system, 
high-level languages, and development environ- 
ments) executed on the commercial product can 
be captured for the E 2 COTS counterpart. Software 
executed on the commercial computer can be exe- 
cuted on the E 2 COTS computer without change at 
the binary level. Further, the system developer can 
use benign environment commercial equipment to 
start developing and testing the military design. 
Finally, standardized code for high-level languages 
such as Ada can be readily transported to subse- 
quent E 2 COTS computers as technology advances. 

VLSI computers must be carefully designed to 
take into consideration even the length of the inter- 
connect etch on the circuit board. A seemingly 
minor change in the characteristics of the etch may 
affect the signal timing, cross talk, or similar param- 
eters, resulting in either unreliability or total fail- 
ure to operate. Thus, any change in the component 
layout to meet the E 2 COTS configuration must be 
undertaken with extreme care and then only when 
required to meet environmental, reliability, or phys- 
ical space requirements. 

Finally, the historical test methodology of design 
validation tests every component used in the 
design. The completed computer is then tested for 
throughput, power consumption, electromagnetic 
compatibility, and durability. For the E 2 COTS sys- 
tem, this expensive and time-consuming test cycle 
is replaced with the review of the commercial com- 
ponents used in the original design. Based on this 
review, some components may be replaced with 
higher quality or specially screened components, 
and environmental and performance verification 
testing of the completed computer follows. It 
should be noted here that testing may be accom- 
plished at the circuit card assembly (CCA) level 
where such CCAs may be separately developed. 
CCAs that are used in conjunction with a standard- 
ized backplane bus such as the VME bus are typi- 
cally developed at this level. 

Development of an E 2 COTS Single 
Module Computer for the VME Bus 

The close cooperation between Raytheon and 
Digital led to an early identification of the DECchip 
21066 and DECchip 21068 processors and VME 64 
bus based on AJpha AXP technology as an excellent 
choice for translation into an E 2 COTS design. Table 1 
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compares the technical specifications of Digital s 
and Raytheon's modules - 1 s There are, at present, a 
number of manufacturers of NDI single module 
computers that build to a configuration much like 
the irCOTS specifications. Most, although not all, 
are based on the Motorola MC68000 series proces- 
sors. Vendors include Motorola Inc., Radstone 
Technology, and DY-4 Corporation. 

The major reasons for choosing Digital s AXPvme 
64 system were the performance and extensive 
software support desired by embedded processor 
users. Further, the computer was being designed 
for the VME 64 backplane bus. The VMK bus has 
been selected by numerous military design organi- 
zations to be the backplane bus of choice, provid- 
ing for an open systems architecture. In addition, 
the AXPvme 64 system incorporated the peripheral 
component interconnect (PCI) bus, thereby offer- 



ing flexibility of I/O design with a minimum of com- 
ponent overhead. 0 

Figure 1 shows the functional block diagram of 
Digitals AXPvme 64 single module computer. In 
most applications, computers of this class are used 
to handle the real-time control of a complex 
system. The computer uses the OHCchip 21068, 
capable of 40 SPECfp92, as the base processor. It 
provides standard I/O: smal I computer systems 
interface (SCSI-2), Ethernet, two serial ports, and 
a VME 64 backplane bus interface as well as three 
configurable timers. Further configuration of the 
module has been made possible through provisions 
for two mezzanine modules. The first contains 
dynamic random -access memory (DRAM) for pro- 
gram and data storage. The second interfaces to the 
PCI bus and provides the user the option of adding a 
custom interface to the module. 



Table 1 Technical Specifications Comparing the Digital Commercial and the Raytheon E 2 COTS 
Single Module Computers 



Physical 
Characteristics 



Digital Commercial 
Module 



Single board 
computer 



PCI mezzanine card 

Software Support 

Operating systems 



Compilers 

Power Requirements 
Power supply 
voltage 



Environmental Specifications 

Operating 

temperature 



Storage temperature 
Temperature change 
Relative humidity 
Mechanical shock 



Acceleration 
Vibration 



Standard Eurocard 
format (6U) 233 mm X 
160 mm (9 inch X 
6.25 inch) (20.3 mm) wide 

5.9 inch X 2.95 inch 

DEC OSF/1 AXP, VxWorks 
for Alpha AXP 

Ada, Fortran, C/C+ + 

With 32 MB and no PCI 
options: 7.64 amperes 
@ 5 VDC and 
0.6 ampere @ 12 VDC 

0°Cto +50°C with 
forced air cooling 
of 200 linearfeet per 
minute at ambient 

-40°C to +66°C 

20°C per hour 

10% to 95% (noncondensing) 

7.5 G peak (±1 G) 
half sine pulse of 
10 ms (±3 ms) 

Not specified 

5-10 Hz 0.02 in 
double amplitude, 
10-500 Hz 
0.1 G peak 



Raytheon E 2 COTS 
Module 



Standard Eurocard 
format (6U) 233 mm X 
160 mm (9 inch X 
6.25 inch) (20.33 mm) wide 

2.5 inch X 5.5 inch 

DEC OSF/1 AXP, VxWorks 
for Alpha AXP 

Ada, Fortran, C/C+ + 

With 32 MB and no PCI 
options: 4.6 amperes 
@ 5 VDC and 
0.6 ampere @ 12 VDC 

-54°C to +55°C 

system ambient 

(70°C sidewall), 

+85°C side rail for 30 minutes 

-62°C to +95°C 

1°C per minute 

0% to 100% (condensing) 

PerMIL-STD-810D 
Method 516.3 

9 G continuous operation 

Sinusoidal 5 Gs 
50-2,000 Hz 
random 0.10 g 2 /Hz 
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DECCHIP 21068 
MICROPROCESSOR 



OPTIONAL 

256-KB 

CACHE 



DRAM MEMORY 
MEZZANINE 
MODULE 
8-256 MB 



ISA INTERFACE 



FLASH 
MEMORY 



TWO SERIAL 
PORTS 



INTERVAL AND 

WATCHDOG 

TIMERS 



REAL-TIME 
CLOCK AND 
NONVOLATILE 
RAM 



VME 64 

INTERFACE TO 
BACKPLANE 



VME 64 BUS 



OPTIONAL 

SCSI-2 



DECCHIP 21040 

ETHERNET 

CONTROLLER 



PCI MEZZANINE 
MODULE 



Note: Shaded functions are on mezzanine modules. 



Figure 1 Block Diagram of Digital's AXPvme 64 Single Board Computer 



Conduction Cooling of the Module 

The design of a commercial VME module must be 
modified to meet the needs of the military: 
Commercial VMK modules (as shown in Figure 2) 
use both the front panel and the connector edges of 
the module for interconnect. Military systems pre- 
clude front (top) of module interfacing because one 
or more cables may be required to be moved for ser- 
vicing. This increases maintenance time and the risk 
of interconnect damage by battlefield personnel. 

Standard commercial modules are normally 
cooled by blowing air over the module. In a com- 
mercial installation, the air is drawn from an air- 
conditioned office environment and is therefore 
devoid of excess humidity or damaging chemicals. 
In the military environment, cooling air is expected 
to contain impurities that will have an adverse 
effect on the long-term, worldwide reliability of the 
module. The AXPvme 64 module is convection 
cooled." One technique used to extend the environ- 
mental range of the E 2 COTS unit is conduction cool- 
ing. Conduction cooling eliminates the need to 
bring air, and with it potentially damaging contami- 
nants, into the computer enclosure. Conformal 
coating, covering the board and components with a 



moisture-resistant material similar to plastic, fur- 
ther ensures no contact between the circuit card 
assembly and contaminants. It also provides protec- 
tion from condensing humidity For these reasons, 
the E 2 COTS module (shown in Figure 3) is config- 
ured to be conduction cooled. 

The decomposition of the module assembly in 
Figure 4 shows a number of techniques used to 
reduce the thermal resistance between the individ- 
ual components and the module/side rail interface. 
The first is the design of the circuit card on which 
all components are mounted. Figure 5 shows the 
layer stackup on the circuit board. Power, both 5.0 
volt (V) and 3.3 V and the associated ground planes 
provide a low-impedance power distribution path 
for the various components and allow the transmis- 
sion of heat from the component to the frame and 
sidewalls. Figure 6 shows the thermal path from a 
typical surface to the sidewal l/heat exchanger. The 
heat from the component is passed into the copper 
power planes for transmission to the sidewall/heat 
exchanger. Due to the low thermal resistance of 
copper plus the increased thickness of these planes, 
the thermal resistance is significantly reduced. In 
addition, the combined copper and polyimide 
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Figure 2 Digital's AXPvme 64 Single Module Computer 




Figure 3 Raytheon Model 910 VME Single Module Computer with Alpha AXP Microprocessor 
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1'lgure 4 Exploded View of the AXPvme 64 Single Module Computer 



layers provide a circuit board with the necessary 
strength to support the components without an 
additional backbone, although one is used for other 
purposes as noted in the next paragraph. 

A second technique is the use of a combination 
thermal and support frame for the memory module 
and PCI adapter. The use of copper-loaded circuit 
cards extends to the PQ and memory modules. The 
thermal path for components mounted on these 



mezzanine modules is from the component 
through the circuit board embedded copper to the 
heat frame. From the heat frame, the thermal path 
is directly to the sidewa I l/heat exchangers. The 
mezzanine modules are designed to be screwed 
into the heat frame for both minimal thermal resis- 
tance and structural support against the shock, 
vibration, and "g" loading indicated in the technical 
specifications. 
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LAYER 
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3.3 VOLTS 
SIGNAL 65 OHMS 
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GROUND 



COPPER 
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0.5 OZ 
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1 OZ 

1 OZ 

3 0Z 



SIGNAL 65 OHMS 1 OZ 
SIGNAL 65 OHMS 1 OZ 



8 GROUND 



3 0Z 



9 

10 

11 

12 



SIGNAL 65 OHMS 1 OZ 

SIGNAL65 OHMS 1 OZ 

2 OZ 
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0.5 OZ 



5 MIL 
7.5 MIL 

5 MIL 
7.5 MIL 

6 MIL 
5 MIL 
6MIL 
7.5 MIL 
5 MIL 
7.5 MIL 
5 MIL 



Figure 5 Printed Circuit Board La yer Stackup 




Figure 6 Thermal Flow to the Heat Exchanger 



Finally, the two most active thermal radiators are 
the Alpha AXP processor and the 5.0-VDC to 3.3-VDC 
regulator. These components have been placed on 
opposite sides of the circuit board and directly adja- 
cent to the wedgelocks to achieve a minimal ther- 
mal path. Because the DECchip 21068 processor is 
mounted cavity down in the ceramic pin grid array 
(PGA), its primary thermal path has been prov ided 
in the form of a cover plate. 



Instead of cooling air passing over the surface of 
the module, the air is passed through a heat 
exchanger. Normally this is a brazed sidewall 
that provides both the outer structural shell of the 
computer and a duct, which has embedded heat 
fins for improved heat transfer. Individual mod- 
ules are structurally connected to the sidewall/ 
heal exchanger by wedgelocks that force a strong 
mechanical and a relatively low thermal interface. 

The nominal temperature rise in the heat 
exchanger for an air transport rack (ATR) chassis 
and a total thermal load of approximately 300 watts 
(W) is 14 degrees Celsius A H Thus, with a nominal 
inlet air temperature of 25 degrees Celsius, the 
wedge lock interface of an E 2 COTS module is at 39 
degrees Celsius. For modules with total thermal dis- 
sipation of 20 to 25 VC^ a nominal 7 degrees Celsius 
rise is anticipated between the sidewall and the mod- 
ule, yielding a module temperature of 46 degrees 
Celsius. The heavy aluminum cover essentially main- 
tains the base module temperature to the micropro- 
cessors case. Measurements of the DECchip 21068 
processor on the computer have shown an average 
power of 5.3 W With a B f _ ( . of 1.1 degrees Celsius 
per watt, the junction temperature is —52 degrees 
Celsius. At the normal high end of the temperature 
range, 70 degrees Celsius inlet air, the chip temper- 
ature will increase to 97 degrees Celsius. It should 
be noted that the examples of temperature rise are 
nominal and must be computed accurately for each 
module type, total chassis dissipation, and the posi- 
tion of the module in the chassis. 

As part of the thermal analysis of the design, a 
thermal map of the base module was developed 
as shown in Figure 7. The figure is an overlay of 
the thermal profile on the mechanical outline 
of the E 2 COTS single module computer. Although 
planning for the dissipation of power from the 
microprocessor and the voltage regulator proved 
successful, the computer-simulated thermal plot 
indicated a high-temperature region at the top cen- 
ter of the module. This area corresponds to the loca- 
tion of the 256 -kilobyte (kB) cache. The junction 
temperature of the cache static llAMs (SRAMs) could 
approach 70 degrees Celsius given an inlet air tem- 
perature of 25 degrees Celsius. 

Although it might be anticipated that the micro- 
processor would be the board hot spot, the higher 
thermal resistance of the printed circuit board 
results in a potentially higher junction temperature 
of the low er dissipating SRAM devices. Operating at 
70 degrees Celsius inlet air temperature., the resultant 
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Figure 7 Thermal Map for the Circuit Card 



SRAM junction temperature would be 104 degrees 
Celsius. Although this high junction temperature is 
still acceptable, it is not desirable because it decreases 
product reliability. Thus, an appropriate modification 
in the thermal design will be made to the circuit 
board stackup before release to production. 

Design Trade-offs 

This section discusses design trade-offs for the 
single module computer based on space and ther- 
mal differences. 

Space Trade-offs 

The conduction-cooled module has significantly 
less surface area for mounting components than its 
convection-cooled counterpart. This is due to the 
use of a thermal frame that serves the dual pur- 
poses of conducting heat to the heat exchanger and 
structurally supporting the mezzanine modules 
to meet shock and vibration specifications. In 
addition to the component mounting constraints 
already identified, Digital's mezzanine module 
provides approximately 17 square inches per side 



for mounting components whereas the Raytheon 
conduction-cooled PCI mezzanine module pro- 
vides approximately 13 8 square inches per side. An 
additional constraint was that the module layout, 
including pad dimensions, had to support a range 
ol components from commercial to Class fl-1 com- 
ponents. As a result, it was necessary to reduce the 
area required for components to fit on the board. 

The necessary reduction in component area was 
accomplished by the incorporation of a number of 
functions into a programmable gate array The func- 
tions include 

1. Fault logic 

2. Interrupt multiplexer 

3. All control/status registers (CSRs) 

4. All address decoding 

5. Interval timer glue logic 

A second and more difficult selection was mod- 
ule I/O functionality. In Raytheon's planning stages, 
it was determined that each single module 



Digital Technical Journal Vol. 6 Va 2 S/irhii* /99<i 



29 



Alpha AXP Partners — Cray, Raytheon, Kubota 



computer needed a SCSI bus port for interfac- 
ing with a disk. Ethernet support was impor- 
tant, but this interface seemed to be needed on 
every computer module only in the development 
phase of a new project. Since the development of 
a PCI adapter to verify the performance of the 
adapter interface was an obvious requirement, 
an adapter was developed for the single module 
computer that contained two interfaces: SCSI and 
Ethernet. An alternate objective of this adapter 
development was to test the capability of the PCI 
drive circuitry to support two interfaces on a single 
PCI adapter Although exhaustive signal integrity 
testing has not been accomplished over the temper- 
ature range, the Ethernet portion of the adapter 
was used in initial debug of the module, including 
download of the system console. It has consistently 
performed without problem. 

A final decision was the establishment of package 
lead geometries that could be supported by both 
commercial and military components. In many 
cases, both commercial and military components 
are available that meet the design criteria. In some 
cases, commercial components are supplied from 
one vendor and military components are procured 
from a second vendor. Unique cases required spe- 
cial solutions. The cache SRAMs are available in 
commercial-quality, J-leaded packages, but no mili- 
tary counterpart could be found. To resolve this 
problem, leadless chip carriers were procured from 
the military vendor and J-leads were welded on the 
basic components by a specialty supplier. 

Thermal Trade-offs 

The extremes of temperature over which an E 2 COTS 
module must operate require careful consideration 
of the effects of thermal cycling on the component 
solder joint with the circuit board. Leadless devices 
such as chip carriers, capacitors, and resistors have 
advantages in the manufacture of circuit boards. 
However, leadless devices also require special care in 
the process whereby these components are attached 
to the circuit card to ensure high solder joint reliabil- 
ity during thermal cycling. For example, Figure 8 
shows a crack in the solder joint of a chip capacitor 
that had undergone thermal cycling to determine 
equipment lifetime under the anticipated operating 
environment. Although these failures can be elimi- 
nated by special manufacturing processes for sol- 
dering leadless components, the use of leaded, 
active components has been made a requirement. 
This is consistent with the use of leadless SRAM with 




Figure 8 Failure Crack in the Solder Joint of 
m Chip Capacitor 



welded-on J-leads described in the previous section 
to help ensure reliability and long module life. 

A second aspect of the thermal environment 
range is the use of large PGA devices soldered into a 
circuit board of copper-polyimide. The Alpha AXP 
device has a diagonal dimension of —2.96 inches. 
The expansion of the ceramic PGA between corner 
pins over a temperature range of -54 degrees 
Celsius to +70 degrees Celsius was studied using 
polyimide boards and ceramic PGAs fully inserted 
into the circuit board to the package standoffs. The 
PGAs did not contain semiconductor devices for 
reasons of cost. 

Pin failures occurred at the corner positions of 
the PGA between 10 and 25 cycles. Additional tests 
were conducted with the PGA inserted so that the 
pin tips protruded slightly below the surface of 
the circuit board before soldering. Thus, the PGA 
was actually standing off the active component sur- 
face of the circuit board. In this configuration, the 
PGA withstood repeated thermal cycles because 
the pins had an opportunity to absorb the strain 
caused by the expansion mismatch. A negative ele- 
ment of this strategy is the inability to adequately 
inspect for solder bridging, which may occur in the 
area under the PGA and on the active component 
surface of the circuit board. It was concluded that 
repeated cycling of the module over even a moder- 
ate part of the temperature range would result in 
the deformation and eventual failure of the pins 
in the corners of the properly mounted PGA. 
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As an alternative to soldering the chip's PGA to 
the circuit board, a socket comprised of individual 
sleeves inserted into each hole was used success- 
fully. This type of socketing provides sufficient con- 
tact flexibility to eliminate pin cracking of the PGA, 
yet provides a reliable contact during shock and 
vibration. With the use of a socket, the question of 
potential "walking out 1 ' of the socket by the PGA was 
raised. The primary thermal path for the Alpha AXP 
processor, as shown in Figure 9, provides the addi- 
tional function of securing the device in the socket, 
thus eliminating the "walk out" problem. 

PCI I/O 

As previously noted, the standard PCI mezzanine 
module design for the single module computer has 
19 percent less surface area than that of Digital's 
mezzanine module. In addition, all I/# from the 
PCI adapter must be routed through 50 pins on 
the P2 connector to the backplane to meet the 



criteria for the standard VME 64 bus. Figure 9 is a 
component side mechanical drawing of the single 
module computer. 

In many single module computer applications, 
the interface to analog, video, and fiber optics is 
required to control or sense synchronous signals 
and status data such as temperature antl air velocity, 
and to hand le video signals (RS-170, RS-343"). For this 
reason the PCI mezzanine module has been 
designed to include an impedance-controlled I/O 
interface by way of a third connector mounted 
between PI and P2. Such an interface was found to 
be superior to routing analog and video signals out 
the P2 connector and made practical the inclusion 
of fiber-optic interfaces directly to the PCI adapter. 

Parts Selection for the 
E 2 COTS Compute*' 

The characteristics of Raytheon's E 2 COTS com- 
puter are detailed in the equipment performance 
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Figure 9 Mechanical Drawing of the Top Surface of the Raytheon Single Module Computer 
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specification. The mechanical features that make it 
compatible with military shock and vibration spec- 
ifications are incorporated at the inception of the 
design. Once the mechanical features have been 
designed into the product, the additional cost at 
production is marginal. The primary factor affect- 
ing the cost is the quality of the semiconductor 
devices used for a given application. In previous 
DoD procurements, all pans were required to pro- 
cure to MIL-STI>883 or SVfJL-STD-38510, the quality 
standards for all electronic components. Included 
in the requirements were hermetically sealed pack- 
ages, semiconductor fabrication process validation, 
and in many cases extensive parts testing. All of 
these factors escalated cost substantially. 

The E 2 COTS system allows the temperature and 
reliability requirements of a given application to 
determine the quality of semiconductor com- 
ponents utilized. In fact, reliability, much more 
than temperature range, forces the incorporation 
of military specification Class B-l components. 
Clearly, there are some component types used by 
the commercial vendors that are inherently not 
suitable for military application. A prime example is 
that of oscillators in which the frequency drift over 
temperature range in commercial components is 
excessive. In the larger view, specified reliability 
is the determining factor because the DoD relies on 
MIL-HDBK-217F for the calculation of component, 
subsystem, and system reliability. MIL-STD-217F is 
the hardware benchmark against which all designs 
are evaluated. 9 Table 2 compares two part types 
that are typical of the single module computer 
design. In both cases the reliability improvement 
achieved in theory by using military-quality parts is 
a factor of five. 

Since many of the passive components (e.g., resis- 
tors and capacitors) are normally procured to mili- 



tary specification, the ratio of calculated reliability 
for a full military-specification-compliant single 
module computer to a commercial single module 
computer is approximately 4.99. For a calculated 
increase in reliability of approximately 5.0, however, 
the full military-compliant module, subsystem, or 
system may cost 10 times that of the commercial sys- 
tem. This is an unacceptable cost-performance 
trade-off in today's defense environment. 

Using an E 2 COTS computer, parts selection is con- 
ducted to meet the required mean time between 
failures (M TBI's) and temperature range. The 'mil- 
spec" semiconductor parts cost is reduced to only 
those parts necessary for the application. The 
robust structure of the module is standard, thus 
providing protection against shock, vibration, and 
acceleration. 

Built-in Test 

Digital's built-in test (BIT), boot, and console code 
are used on an almost "as is" basis. The diagnostics 
provided on previous processors, such as Digital's 
VAX 6200 and VAX 6600 systems and the DEC 3000 
Model 500 AXP workstation, have proven to be very 
robust. The exception is the incorporation of a sys- 
tem-level BIT strategy that is built upon the existing 
BIT design. 

The BIT from each system component must be 
capable of being integrated into the overall system 
environment so that system- level test results 
may be easily obtained and the failed component 
rapidly replaced. To meet this requirement, 
Raytheon has extended the access to the BIT infor- 
mation at the system level by making test results 
available on the VME 64 bus. This is accomplished 
by using the VMF interprocessor communication 
registers (ICRs) as mailboxes that may be accessed 
by any bus user. Upon initialization, the ICRs are set 



Table 2 A Comparison of Reliability for Commercial-quality and Military-quality Parts 



Device Type 



32K X 8 SRAMs 
for the cache 

VIC-64 VME 
interface 



Reliability 
Calculated for 
Class B-1 Parts 
at 25°C for a 
Transport 
Aircraft 
Environment 



0.137 failures 
per million hours 

0.613 failures 
per million hours 



Reliability 
Calculated for 
Commercial Parts 
at 25°C for a 
Transport 
Aircraft 
Environment 



0.686 failures 
per million hours 

3.066 failures 
per million hours 



Ratio of 

Calculated 

Military-quality-part 

Failure Rate to 

Commercial-quality-part 

Failure Rate 



5.007 
5.001 
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to zero. At the end of the BIT, the results are written 
to the ICRs. Basically, there are three possible 
results available in the ICRs after BIT; 

1. The ICRs contain zero, in which case the module 
has failed to execute the complete BIT and is 
therefore FAILED. 

2. The ICRs contain the PASSED message. 

3. The ICRs contain the FAILED message and iden- 
tify the test(s) that were failed. 

Supervisory processors may poll the single module 
computers and determine their status. 

Planned Upgrades to the Model 910 

The first deliveries of the Raytheon Model 910 
utilize the 66 -megahertz (66-MHz) DECchip 21068 
processor. Since capabilities drive requirements, 
the availability of the DECchip 21066 necessitates 
the addition of a 160-MHz version of the iModel 910. 
Key issues in the incorporation of the DECchip 
21066 processor into the single module computer 
structure are the thermal dissipation of the design 
and the limited number of power and ground pins 
as provided under the VME bus specification. 

Power dissipation of 23 W occurs on a system 
powered by the DECchip 21068 and having 32 
megabytes (MB) of memory, a SCSI bus, and Ethernet 
running the DEC OSF/1 AXP operating system and a 
graphics demonstration on an X window terminal. 
When the same unit was exercised with the 
DECchip 21066, the power dissipation increased 
to 40 W, underscoring the need for more power/ 
ground pins and additional thermal paths to the 
sidewall/heat exchanger. The memory capacity will 
also be expanded in 1994 to a maximum of 256 MB 
in increments of 128 MB. 

Completion of these design upgrades is being 
conducted during 1994. 
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Volume Rendering with the 
Kubota 3D Imaging and 
Graphics Accelerator 

The Kubota 3D imaging and graphics accelerator, which provides advanced 
graphics for Digital's DEC 3000 AXP workstations, is the first desktop system to com- 
bine three-dimensional imaging and graphics technologies, and thus to fully 
support volume rendering. The power of the Kubota parallel architecture enables 
interactive volume rendering. The capability for combining volume rendering 
with geometry-based rendering distinguishes the Kubota system from more special- 
ized volume rendering systems and enhances its utility in medical, seismic, and 
computational science applications. To meet the massive storage, processing, and 
bandwidth requirements associated with volume rendering, the Kubota graphics 
architecture features a large off-screen frame buffer memory, the parallel process- 
ing power of up to 20 pixel engines and 6 geometric transform engines, and wide, 
high-bandwidth data paths throughout. 



The Kubota 3D imaging and graphics accelerator, 
which provides advanced graphics for Digital's 
DEC 3000 AXP workstations, enables interactive 
volume rendering— a capability that is unique 
among workstation-class systems. This paper 
begins with a discussion of the relations between 
imaging, graphics, and volume rendering tech- 
niques. Several sections then discuss the nature and 
sources of volume data sets and the techniques of 
volume rendering. The paper concludes with a 
description of how volume rendering is imple- 
mented on the Kubota accelerator. 

This paper is also intended as a tutorial on volume 
rendering for readers who may not yet be familiar 
with its concepts and terminology. Following 
the body of the paper, the Appendix reviews the 
basic ideas and terminology of computer graphics 
and image processing, including digital images and 
geometry-based models, that lead to the ideas of 
volume rendering. Readers may wish to turn to the 
Appendix before proceeding to the next section. 



This paper is a modified version of Volume Rendering with 
Dene/ ft, Version I.O, which was written by Ronald I). Le\ ine and 
published as a white paper by Kubota Pacific Computer Inc., 
June 1993. Copyright 1993, Kubota Pacific Computer Inc. 



Geometry, Pixels, and Voxels 

Historically, computer graphics and image process- 
ing have been distinct technologies, practiced by 
different people for different purposes. Graphics 
has found application in computer-aided design, 
engineering analysis, scientific data visualization, 
commercial film, and video production. Image pro- 
cessing has found application in remote sensing for 
military, geophysical, and space science applica- 
tions; medical image analysis; document storage 
and retrieval systems; and various aspects of digital 
video. 

There is a recent trend for these two approaches 
to computer imaging to converge, and the Kubota 
3D imaging and graphics accelerator is the forerun- 
ner of systems that enable the combination of imag- 
ing and graphics technologies. Users of either of the 
technologies increasingly find uses for the other as 
well. One result of this synergism in the combina- 
tion of computer graphics and image processing is 
the advent of volume visualization and volume ren- 
dering methods, i.e., the production of images 
based on voxels. 

The idea of the voxel-based approach is a natural 
generalization of the idea of pixels — digital picture 
elements; voxels simply add another dimension. 
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(See Figure 1 and Figure 2.) But voxel-based meth- 
ods are slow to be adopted because their per- 
formance requirements are massive, in terms of 
processing power, storage, and bandwidth. The 
performance and capacity requirements of the voxel 
methods are indicative of the general fact that the 
size of the problem increases in proportion to 
the cube of the resolution. A two-dimensional (2-D) 
image with n pixels on a side has n 2 pixels, whereas 
a three-dimensional (3-D) volume data set with 
n voxels on aside has ;? 3 voxels. If the typical mini- 
mum useful linear resolution in an imaging applica- 
tion is 100, then a volume data set will have at least 
100 times the data of a digital image. 

The Kubota accelerator is an imaging and com- 
puter graphics system. That is, it contains hardware 
and firmware support both for producing images 
from geometry-based models and for accelerating 
certain fundamental image processing functions. 
As a graphics system, the Kubota accelerator pro- 
duces raster images from 3-D geometric models. It 
offers hardware support for the graphics pipeline, 
including geometry processing, lighting computa- 
tion, shading interpolations, depth buffering, ras- 
terization, and high-quality rendering features such 
as antialiasing, texture mapping, and transparency. 
As an image processing system, the Kubota acceler- 
ator includes hardware support for basic image pro- 
cessing functions, such as pLxel block transfers, 
image zooming and rotating, image compositing, 
and filtering. Some of these low-level image pro- 
cessing functions are used in the graphics pipeline 
and for advanced features such as antialiasing and 
texture mapping. 

Volumetric rendering methods share certain fea- 
tures with both computer graphics and imaging. 
Common to volume rendering and graphics are the 
mathematical transformations of viewing and pro- 
jection, as well as the surface shading methods 
when the volume rendering method is isosurface 



Figure J Two-dimensional Pixels 



rendering. Common to volume rendering and 
image processing are the resampling and filtering 
operations, which are even more costly than in 
image processing because of the additional dimen- 
sion. It is not surprising that a system that imple- 
ments both imaging and graphics functionality is 
also amenable to volume rendering. 

Like 3-D geometry-based graphics techniques, 
volume rendering techniques help the user gain 
understanding of a 3-D world by means of images, 
that is, projections to the 2-D viewing plane. As 
in geometry-based graphics, one of the most effec- 
tive means of helping the viewer understand a 3-D 
arrangement through 2-D images is to provide inter- 
active control over the viewing parameters, i.e., the 
3-D position and orientation of the subject relative 
to the viewer. 

In volume rendering, other visualization parame- 
ters beyond the viewing parameters need inter- 
active control. For example, the exploration of 
volume data is greatly facilitated by interactive con- 
trol of the position and orientation of section 
planes, of isosurface level parameters, and of sam- 
pling frequencies. 

Therefore, volume rendering methods are most 
useful when they are feasible in interactive time. 
When the viewer turns a dial, the rendered image 
should be updated without noticeable delay. The 
upper limit on the response time implied by the 
interactive control requirement and the lower limit 
on problem size implied by the requirements of 
usable resolution combine to set extreme perfor- 
mance thresholds for practical volume rendering 
systems. The Kubota 3D imaging and graphics 
accelerator is able to meet these performance 
demands because it has a large image memory, 




Figure 2 Three-dimensional Voxels 
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powerful parallel processing elements, and high- 
speed data paths that connect these components. 

Volume Data Sets 

After briefly defining a volume data set, this section 
describes three application areas that are sources of 
volume data sets. 

Volume Data Set Definition 
A volume data set, also called volumetric data, is a 
generalization to 3-B space of the concept of a digi- 
tal image. A volume data set contains sample values 
associated with the points of a regular grid in a 3-D 
space. Each element of a volume data set is called 
a voxel, analogous to a pixel, which represents an 
element of a 2-B digital image. Just as we sometimes 
think of pixels as small rectangular areas of an 
image, it makes sense to think of voxels as small vol- 
ume elements of a 3-D space. 

Volume Data Sets in Medical Imaging 
Because the real world has three spatial dimensions, 
virtually any application area that produces ordi- 
nary image data can also be a source of volume data 
sets. For example, in medical computed tomogra- 
phy (CT) imaging, when the 3-* structure of an 
organ is being studied, it is common to take a series 
of CT exposures through a set of parallel planar 
slices. These data slices represent a sampling of the 
underlying tissue (of some predetermined thick- 
ness) and can be stacked up to produce a volume 
data set. Magnetic resonance imaging (MKQ, ultra- 
sound imaging, and the positron emission tomogra- 
phy (PET) and single-photon emission computed 
tomography (SPECT) scan methods of nuclear medi- 
cine all can similarly produce volume data sets. 

Volume Data Sets in Seismic Exploration 
Another important application area that gives rise 
to massive volume data sets is seismic exploration 
for petroleum and mineral resources. In seismic 
exploration, an acoustic energy source (usually a 
dynamite charge or a mechanical vibrator) radiates 
elastic waves into the earth from the surface. 
Receivers on the surface detect acoustic energy 
reflected from geologic interfaces within the earth. 
The data gathered at each receiver is a time series, 
giving the acoustic wave amplitude as a function of 
time. A single pulse from the source yields a time 
series of pulses at each receiver. The amplitude 
of a reflected pulse carries information about 



the nature of the rock layers meeting at the inter- 
face, and its arrival time is directly related to the 
interface s depth in the earth. The receivers are gen- 
erally placed on a regular 2-1) grid on the surface. 
There may be more than a thousand receivers 
involved in a single seismic experiment. There may 
be hundreds of strata in the vertical direction that 
contribute pulses to the reflected signals, so the 
time-sampling resolution must also be measured in 
the hundreds or more. Thus, the seismic data com- 
prises a large volume data set in which the value of 
each voxel is an acoustic amplitude, and the voxel 
array has two horizontal spatial dimensions and 
one vertical time dimension. The time dimension is 
directly related to depth in the earth. The relation- 
ship is not simple, however, because the acoustic 
wave velocity varies substantially from stratum to 
stratum. In fact, determination of the depth- time 
relationship is one initial objective of the interpre- 
tation activity 

The ultimate objective of the seismic interpreter 
is to locate the regions most 1 ikely to contain oil and 
gas deposits. These deposits occur only in porous 
permeable rock that is completely surrounded by 
impermeable rock. The strata are approximately 
horizontal, but slight tilting from the horizontal 
(called dip) and strata discontinuities caused by 
faults are extremely important clues for locating the 
regions where petroleum deposits may be trapped. 

Proper interpretation of seismic data is not trans- 
parent; it requires trained specialists. Because the 
data for each seismic study can be so massive, inter- 
preters have always relied heavily on graphical 
methods. For a long time, the conventional graphi- 
cal methods have used long paper strip charts, each 
recording perhaps hundreds of parallel traces. 
Interactive workstations that use 2-B graphics and 
imaging methods have begun to be adopted in the 
seismic interpretation industry The application 
of volume rendering methods is not yet widespread 
in seismic interpretation, but the advent of the 
Kubota accelerator, with its volume rendering capa- 
bility, should stimulate the development of such 
applications. 

Volume Data Sets in 
Computational Science 

Volume data sets also arise naturally as computer- 
synthesized data in computational science. Three- 
dimensional fields are quantities that are attached 
to all the points in defined regions of 3-D space and 
generally vary from point to point and in time. The 



36 



Vol. 6 \u. 2 Spring 1994 Digital Technical Journal 



Volume Rendering with the Kubola ]D Imaging and Graphics Accelerator 



laws of physics that govern the evolution of 3-D 
physical fields are most commonly expressed in 
terms of partial differential equations. The simula- 
tions of computational physics, such as computa- 
tional fluid dynamics, stress and thermal studies in 
engineering, quantum physics, and cosmology, all 
study 3-D fields numerically by sampling them on 
finite sets of sample points. In many numerical 
methods, the sample points are arranged in regular 
grids or can be mapped to regular grids; thus, the 
field samples give rise to volume data sets. 

Because the objects of study are continuous, 
there is no limit to the desired resolution or to the 
desired size of the volume data set. In practice, 
the grid resolutions are limited by the computing 
power of the machines used to perform the numer- 
ical solution, typically supercomputers for most 
3-D problems. With today's supercomputers, a sin- 
gle 3-D field simulation may use millions of sample 
points. 

Oil reservoir simulation — the modeling of the 
flow and evolution of the contents of subterranean 
oil deposits as the oil is extracted through wells — is 
another example of computational simulation. It 
also makes use of supercomputers, and it presents 
the same kind of 3-D data visualization problems as 
computational basic science. 

The computational scientist is absolutely depen- 
dent on graphical visualization methods to explore, 
comprehend, and present the results of supercom- 
puter simulations. For three- (and higher) dimen- 
sional work, most supercomputer visualization has 
depended on geometry-based tools, using modeled 
isosurfaces, stream lines, and vector advection 
techniques. Now, with hardware readily available 
that allows interactive volume visualization, the 
use of volume rendering methods as a means of 
exploring the large data sets of computational sci- 
ence will grow. 

Volume Rendering 

Although it is easy to understand the sources and 
significance of volume data sets, it is not as easy to 
determine the best way to use a raster imaging sys- 
tem to help the user visualize the data. After all, the 
real images displayed on the screen and projected 
onto the retinas of the eyes of the viewer are ordi- 
nary 2-D images, made of pixels. These images nec- 
essarily involve the loss of some 3-D information. So 
one objective of volume visualization techniques 
should be to give the user 2-D images that commu- 
nicate as much of the 3-D information as possible. 



As mentioned earlier, interactive control over view- 
ing parameters is an excellent means of conveying 
the 3-D information through 2-D images. Moreover, 
some volume rendering applications require inter- 
active control of other parameters, such as section 
planes, isosurface levels, or sampling frequencies, 
in order to allow the user to explore the volume 
data set. 

Volume rendering refers to any of several tech- 
niques for making 2-D images from the data in vol- 
ume data sets, more or less directly from the voxel 
data, respecting the voxels' spatial relationships in 
all three dimensions. We include in the definition 
certain methods that make use of rendering sur- 
faces determined directly by the voxel data, such as 
interpolated isosurfaces. Foremost, volume render- 
ing methods make images from the volume data that 
depend on the fully 3-D distribution of data and 
that are not necessarily limited to a fixed set of pla- 
nar sections. The idea of volume rendering is to 
make images in which each pixel reflects the values 
of one or more voxels combined in ways that 
respect their arrangement in 3-D space, with arbi- 
trary choice of the viewing direction and with con- 
trol over the sampling function. 

We can think of an ordinary X-ray image as the 
result of an analog volume rendering technique. 
The X-ray image is a result of the X-ray opacity 
throughout the exposed volume of the subject. 
The image density at any point of the image is 
determined by the subject's X-ray opacity inte- 
grated along the X-ray that comes to that image 
point from the source. Thus, the information about 
how the opacity is distributed along the ray is 
lost. This loss of information is in part responsible 
for the difficulty of interpreting X-ray images. The 
radiologist can make use of several different 3-D 
exposures of the subject acquired in different direc- 
tions to help understand the 3-D situation. A prede- 
termined sampling of views, however, cannot 
compare with true interactive view adjustment as 
an aid to 3-D comprehension. Moreover, the radiol- 
ogist has no ability to vary the sampling function; it 
will always be the integral of the X-ray opacity 
along the rays. 

Of course, the ordinary X-ray image does not pro- 
vide us with a volume data set, but we can obtain 
volume data sets from CT imaging, as previously 
described. The following volume rendering meth- 
ods enable the user to explore the X-ray opacity 
using arbitrary viewing parameters or different 
sampling functions. 
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The methods described here, in analogy with the 
X-ray as an analog volume renderer, all use families 
of paral lei rays cast into the volume, usually one ray 
for each pixel in the displayed image. In general, 
the volume data set is resampled along the rays. The 
methods differ in the choice of the function that 
determines pixel color according to the sample 
values along a ray 

Without volume rendering, the radiologist must 
treat this CT volume data set as a sequence of inde- 
pendent image slices. To get a 3-D picture of the 
X-ray opacity function, the radiologist must mentally 
integrate the separate images, which are presented 
either sequentially or in an array of images on a dis- 
play surface. The particular viewing direction for 
an acquired image sequence may not be optimal 
with respect to the real-world 3-D anatomical struc- 
tures under study Volume rendering provides the 
remedy — an ability to vary the viewpoint arbitrarily: 

The Magnitude of the Volume 
Rendering Problem 

Volume data sets tend to be very large, and their 
sizes increase rapidly as the resolution increases. 
The number of voxels in a volume data set increases 
in proportion to the cube of the linear resolution; 
for example, a 100 X 100 X 100 grid contains 1 mil- 
lion points. A typical medical CT volume data set 
has 512 X 512 X 64 (i.e., 2 24 ) sample points, or more 
than 16 million voxels. 

The large amount of volume data implies a need 
for massive processing requirements. For instance, 
all volume rendering techniques involve resam- 
pling the volume data at at least as many points 
as there are pixels in the rendered image. Some of 
the current volume rendering methods require 
multiple samplings of the volume data for each 
pixel. Minimizing the aliasing effects of resampling 
requires interpolation of values from voxels near 
the sample point. Trilinear interpolation, the sim- 
plest interpolation scheme that accounts for the 
variation of the data in all three dimensions, 
requires accessing eight voxels and performing 
about 14 additions and 7 multiplications for each 
point. For sampling on a single plane section of the 
volume, the number of sample points is propor- 
tional to n 2 , where n is the typical resolution of the 
resulting image. But for the volume rendering 
methods that involve tracing through the volume, 
such as several of the methods described in this sec- 
tion, the number of sample points is proportional 



to n\ where n is the average resolution of the 
volume data set. 

The combined requirements of (1) high resolu- 
tion to achieve useful results, (2) trilinear interpo- 
lation for antialiasing, and (3) interactive response 
time imply that adequate processing power for vol- 
ume rendering must be measured in hundreds of 
millions of arithmetic operations per second. These 
operations are either floating-point operations or 
fixed-point operations with subvoxcl precision. 

Memory access bandwidth and the bandwidth of 
the other data communication paths in the system 
are further potential limits to volume rendering. 
The volume data size, the amount of processing 
needed for each volume-rendered frame, the inter- 
polation requirement of multiple accesses to each 
voxel for each frame, and the interactive require- 
ment of multiple frames per second all contribute 
to massive requirements for data path bandwidth. 

Kubota's architectural features address all these 
requirements. The large off- screen frame buffer 
memory accommodates the large volume data sets. 
The highly parallel processing elements, up to 20 
pixel engines and 6 geometric transform engines, 
meet the demands for processing power. The 
Kubota accelerator has wide data paths and high 
bandwidth throughout. The pixel engines have 
short access paths and high aggregate memory 
bandwidth to the voxel storage and image display 
memories. With these architectural features, Kubota 
offers a level of hardware acceleration for volume 
rendering that is unique in the workstation world. 

Combining Volume Rendering and 
Geometry-based Rendering 
Most applications that produce volume data sets 
from sampling measurements also involve objects 
that arc defined geometrically. Such applications 
can frequently make good use of a facility for pro- 
ducing images using both kinds of initial data 
together. The most familiar examples come from 
the area of medical imaging, but other areas are also 
potential sources. 

In some cases, the goal is to use the volume data 
to derive a geometric description of a scanned 
object (such as the surface of an anatomical organ 
or tumor) from the medical image data that pro- 
vides a voxel representation. Such an activity bene- 
fits from checking the derivation of the geometry 
by rendering it (using ordinary geometry rendering 
methods) onto an image that also directly displays 
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the original volume data by means of volume visual- 
ization methods. 

In other cases, the 3-D scene contains geometric 
objects defined independently of the scanned data. 
In the simplest case, the geometric objects may be 
reference frames or fiducials, in the form of planes 
or wire-frame boxes. A more complex example is 
the design of a prosthesis, such as an artificial hip 
joint. The prosthesis must be built and machined to 
an exact fit with the patient's skeletal structure. The 
bone geometry is determined by CT scans and pre- 
sented as voxel data. The prosthesis is designed and 
manufactured using CAD/CAM methods, which are 
geometry based. The fit of the prosthetic device 
can be verified visually in a display system that com- 
bines the rendered geometry data from the CAD sys- 
tem with the voxel data from the medical scanning 
system. 

Other medical imaging applications that benefit 
from mixing geometry-based imagery with voxel- 
based imagery include surgical planning and radia- 
tion treatment planning. Information on the 
distribution of bones, blood vessels, and organs 
comes to the surgeon in the form of volume data 
sets from one of the 3-D medical imaging modes, 
whereas, X-ray beam geometries, surgical instru- 
ments, fracture planes, and incision lines are all 
future objects or events that are usually described 
geometrically at the planning stage. (See Figure 5 
in the following section for an example from radia- 
tion treatment planning.) 

Among the other application areas that combine 
volume rendering and geometry-based rendering 



techniques are seismic data analysis and industrial 
inspection and testing. In seismic data analysis, the 
imaging and volume data from sonic experiments 
coexists with geographic/topographic data or well- 
log data that can be described geometrically. In 
industrial inspection and testing, displaying the test 
image data together with an image rendered from 
the geometry-based CAD model of a part may aid in 
improving the quality of the part. 

As shown in the following section, the Kubota 
accelerator subsystem is capable of supporting 
simultaneous presentation and merging of volume- 
rendered and geometry-based imagery. 

Volume Rendering Techniques 

This section describes the four volume rendering 
methods that have been implemented on the 
Kubota accelerator hardware: (1) multiplanar refor- 
matting, (2) isosurface rendering, (3) maximum 
intensity projection, and (4) ray sum. These meth- 
ods, which constitute an important subset of the 
volume rendering techniques currently in use, all 
employ the technique of casting a bundle of parallel 
rays into the volume and then resampling the vol- 
ume data along the rays. Figure 3 illustrates the ray 
casting for a single ray. The figure shows 

■ A virtual view plane (or virtual screen), which 
appears ruled into pixels as it will be mapped to 
the display surface 

■ The volume data set, which is oriented arbitrar- 
ily (with respect to the view plane) and ruled to 
indicate voxels 




Figure 3 Ray Casting fo r Vol ume Rend ering 
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■ A single ray, which projects from one of the pix- 
els and intersects the volume box 

■ Sample volume data points along the ray 

For clarity, Figure 3 shows the pixel and voxel res- 
olutions and the sample point density much lower 
than they are in actual practice, which is typically 
an order of magnitude higher. Also, although each 
pixel in the view plane has an associated ray, the fig- 
ure illustrates only one ray of the parallel bundle. 

The sample points generally do not coincide 
with voxel positions, so the sample values must 
be estimated from the values of the nearby voxels. 
The simplest sampling method uses the value of 
the nearest voxel for each sample point. To obtain 
better accuracy and to avoid unwanted artifacts, 
however, the sample value should be obtained by 
interpolating from nearby voxels. The interpolation 
must be at least linear (in terms of the algebraic 
degree) but should also respect all three spatial 
dimensions. The standard voxel sampling method 
uses trilinear interpolation, which computes each 
sample value as a blend of the eight voxels at the 
corners of a cube that contains the sample point. 

The color (or gray level) used to display the pixel 
depends on the volume data values sampled along 
the ray. The set of sample points and how the sam- 
pled values determine the pixel color define the 
volume rendering method. Each pixel in the image 
has an associated ray for which the sampling opera- 
tion must be performed. 

Multiplanar Reformatting 
One of the most direct means of giving the user 
views of volume data is through multiplanar refor- 
matting. This method consists of displaying the vol- 
ume data on one or more specified plane sections 
through the volume that do not need to be perpen- 
dicular to the viewing direction. As described pre- 
viously, a bundle of rays, with one ray for each pixel 
in the image and all rays parallel to the viewing 
direction, is cast into the volume. Each ray has sam- 
ple points where it intersects the specified section 
planes. A sample value determines the pixel color 
according to a defined color map. When there are 
multiple section planes, a depth buffer determines 
which section plane defines the color of each pixel. 

Multiplanar reformatting is most effective when 
the user has interactive control over the section 
planes. Although some users may want to be able 
to change both the position and the orientation of 
the section plane, the interactive variation can be 



better understood if only one parameter is varied, 
usually the position of the section plane along 
a line perpendicular to it. 

Multiplanar reformatting can be combined 
advantageously with isosurface rendering and with 
geometry-based surface rendering. Figure 4 displays 
an example of multiplanar reformatting together 
with isosurface rendering. 

Isosurface Rendering 

For a scalar field in 3-D space, the set of points on 
which the field value is a particular constant is called 
an isosurface or a level surface. A traditional method 
of visualizing a scalar field in 3-D space is to draw or 
display one or more isosurfaces using the rendering 
methods of geometry-based graphics. A volume data 
set represents a sampling of a scalar field. This set 
can be used to render the field's isosurfaces directly 
by a volume rendering method that resamples along 
rays projected from the virtual view plane. 

The isosurface rendering method can be particu- 
larly effective when the object volume has sharp 
transitions where the field value changes rapidly in 
a small region, such as the interface between soft 
tissue and bone in a CT data set. Choosing the level 
value anywhere between the typical small opac- 
ity value of the soft tissue and the typical large 
opacity value of the bone produces an isosurface 
thai is an accurate model of the actual surface of the 
bone. Choosing the level value between the very 
low opacity value of the air and the higher opacity 
value of the skin produces an isosurface that corre- 
sponds to the surface of the skin. 

Kubota's isosurface rendering technique deter- 
mines the visible isosurface by a depth buffer 
method. For each pixel in the image, the depth 
buffer records the estimated depth in the scene 
where the ray through the pixel first crosses the 
isosurface level value. Accurate determination of 
the threshold depth requires sampling each ray 
at many more points than with multiplanar refor- 
matting. (The multiplanar reformatting method 
requires each ray to be sampled only at the rela- 
tively few points of intersection with the specified 
section planes.) Isosurface rendering is a truly 3-D 
sampling computation. 

Once determined, the surface representation in 
the depth buffer must be shaded for display That is, 
colors must be assigned to all the pixels in a way 
that makes the surface topography evident to the 
viewer. The Kubota implementation uses a simple 
lighting model for the surface reflectance, namely, 
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Figure 4 Isosurface Rendering with Mul tipla nar Reformatting 



the standard model for diffuse reflection known as 
Lambert's law. The controlling parameter of the 
lighting computation is the surface normal vector, 
which specifies the orientation in space of the sur- 
face at a given point. The normal vector is simply 
related to the gradient of the depth function, which 
can be estimated numerically by differencing the 
depth values of neighboring pixels in both princi- 
pal directions. 

As a further aid to visualizing the shape of the iso- 
surface in 3-D space, the method can apply a depth- 
cueing interpolation to the final pixel colors. For 
each pixel, the color determined by the shading is 
blended with a fixed depth-cue color (typically the 
background color), with the proportions of the 
blend dependent on the depth of the surface point 
in the scene. This interpolation simulates the gen- 
eral fact that more distant objects appear dimmer 
and thus can add 3-D intelligibility to the image. 

Isosurface rendering can be combined with mul- 
tiplanar reformatting by using a depth buffer to 
control the merging of the two images. An effective 
application of this capability to CT or MR data is to 
use isosurface rendering to display the outer surface 



(skin) and a moving section plane to display the 
interior data. The intersection of the section plane 
with the skin surface provides a good reference 
frame for the section data. Figure 4 shows an exam- 
ple of such an image. Note that in this image, the 
pixel value shown at each position comes from the 
surface further from the viewer, whereas the usual 
depth comparison used in geometry-based render- 
ing shows the pixel value from the nearer surface. 

Volumetric isosurface rendering and multiplanar 
reformatting can also be combined with geometry- 
based rendering using the depth buffer to merge 
the image data. Figure 5 shows an example relevant 
to radiation treatment planning. The surface of the 
head and the plane section are produced by volume 
rendering methods applied to a CT volume data set. 
The surface with several lobes is a radiation dosage 
isosurface described geometrically on the basis of 
the source geometry. 

Maximum Intensity Projection 
In maximum intensity projection, the color of each 
pixel is assigned according to the maximum of the 
voxel values found along its ray. This type of volume 
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rendering is most useful in generating angiograms 
(or projection maps of the human vasculature) from 
MRI data sets. Special MR acquisitions are performed 
in which the signal from flowing blood (or from 
the hydrogen nuclei in the flowing blood) is more- 
intense than that from the surrounding tissue. 

To locate the maximum value, each ray must 
be sampled at many points along its entire inter- 
section with the voxel volume. Thus, maximum 
intensity projection is also a truly 3-D resampling 
computation that requires Kubota's parallel pro- 
cessing capability. To be at least as accurate as the 
original volume data set, the sampling frequency 
should be comparable to the voxel resolution. If the 
sampling frequency is much smaller, there is a risk 
of large error in identifying and estimating the max- 
imum value. On the other hand, using a sampling 
frequency that is much higher than the voxel reso- 
lution considerably increases the cost of the method 
and yields little benefit. 

Ray Sum 

In ray sum processing, rays are cast into a volume 
from a user-specified orientation, and intensities 
are accumulated from interpolated samples along 



the ray. The projected image produced by ray sum is 
the digital equivalent of an X-ray image generated 
from a volume of CT data. The ray sum technique 
permits an analyst to generate an X-ray-like image 
along an arbitrary direction for which directly 
obtained X-rays cannot produce a high-quality 
image. For instance, X-rays projected along a direc- 
tion parallel to the spinal column of a human gener- 
ally produce images of limited diagnostic value 
because too much matter is traversed between 
emission and absorption. Generating a ray sum 
operation on a reduced CT volume that contains 
only the tissue of interest results in a high-quality, 
clear image of the tissue structure. 

Kubota Volume Rendering 
Implementation 

The Kubota 3D imaging and graphics accelerator 
offers unique capabilities for hardware support 
of interactive volume visualization techniques in a 
general graphics workstation context. It is the first 
system on a desktop scale to provide useful volume 
rendering in interactive time. Moreover, the Kubota 
accelerator is unique among specialized volume 
rendering systems in its capability for combining 
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volume rendering and geometry-based rendering to 
produce images. (See Denali Technical Overview 
for more details on the rendering process and the 
architecture of the Kubota 3D imaging and graphics 
accelerator.) 1 

The power of the Kubota accelerator for volume 
rendering stems from 

■ A large off-screen frame buffer memory, which is 
used for volume data storage in volume rendering 

■ The parallel processing power of the pixel 
engines (PFs) and the transform engines (TEs) 

■ High-bandwidth data paths throughout 

In particular, the short, wide data paths that con- 
nect the PEs to the large local memory on the frame 
buffer modules (FBMs) are important in enabling the 
resampling and interpolation of the voxel values, 
which is the most costly part of volume rendering. 

The volume rendering implementation on the 
Kubota accelerator uses some of the same archi- 
tectural elements that support the high-quality 
geometry rendering features. The resampling and 
interpolation functionality is similar to the support 
for 3-D texture mapping. Some volume rendering 
operations also use the geometry processing func- 
tionality of the TEs and the scanning and incremen- 
tal interpolation functionality of the linear evaluator 
arrays on the TE modules. Several methods use 
depth merging to combine planar sections or to 
combine volume rendering with geometry render- 
ing. The implementation of these methods exploits 
the depth-buffering and depth-compare features of 
the FBMs and the PEs. 

Memory Usage and Volume Tiling 
Volume rendering requires fast memory to handle 
the volume data set and the displayed image. All the 
methods discussed in this paper also require fast 
memory for intermediate results, principally the 
projected subimages computed in the first stage 
(described later in this section). The methods that 
use depth merging also require a fast depth buffer. 
The volume data set itself, the intermediate results, 
and the depth buffer all use the off-screen frame 
buffer memory (dynamic random-access memory 
[DRAM]) in the FBM draw buffers. The displayed 
image, which is the end result of the volume ren- 
dering operations, resides in the on-screen frame 
buffer memory (video random-access memory 
[VRAM]) in the FBM display buffer, which is used to 
refresh the display. 



The Kubota accelerator offers two draw buffer 
memory configurations: 2M bytes (MB) per FBM and 
4MB per FBM. There can be 5, 10, or 20 FBMs. Thus, 
the memory available for volume data, intermediate 
results, and depth buffer can range from 10MB to 
80MB. As a rule of thumb, about half this memory 
can be used for the volume data set, and half is 
needed for intermediate results and depth buffer. 
Therefore, the largest configuration has 40MB of 
fast memory for volume data — enough to store 
the volume data sets of a wide range of potential 
applications. 

Of course, the volume data must be distributed 
among the FBMs. To benefit from the Kubota archi- 
tectural features, the volume must be partitioned so 
that most of the data flow in trilinear interpolation 
is within FBMs rather than between them. Trilinear 
interpolation is a local 3-D operation, that is, its 
computation involves combining data from each 
voxel with its neighbors in all three dimensions. 
Therefore, the volume data must be partitioned into 
approximately cubical, contiguous 3-D subvolumes. 

The storage and accessing of the subvolumes use 
the same mechanisms as the 3-D texture-mapping 
capabilities. In the texture-mapping case, each FBM 
contains a copy of the same texture, which can 
have 64 X 64 X 64 four-byte texture elements. For 
volume rendering, however, each FBM contains a 
different subvolume of the volume data set being 
rendered. Moreover, the Kubota volume rendering 
implementation treats only single-channel volume 
data, which can have either 1 or 2 bytes per channel. 
Therefore, each 64 X 64 X 64 block of 4-byte tex- 
ture elements can store four 64 X 64 X 64 blocks of 
single-precision volume data or two 64 X 64 X 64 
blocks of double-precision volume data. Thus, an 
FBM with 4MB of DRAM can have eight 64 X 64 X 64 
single-precision blocks or four 64 X 64 X 64 double- 
precision blocks. 

When the volume data set is smaller than the 
maximum that the configuration supports, the sub- 
volume blocks are smal ler than 64 X 64 X 61 and 
remain geometrically congruent and evenly distrib- 
uted among the FBMs to maintain full parallelism. 

To preserve locality at the edges of the subvol- 
umes, the subvolume blocks are not completely dis- 
joint; adjacent blocks overlap by one voxel slice. 
Because of the overlap and another constraint 
related to the way FBMs are grouped by scan line, 
the maximum size of the volume data that can be 
accommodated is slightly less than 4 X 64^ bytes per 
FBM. A maximal Kubota accelerator configuration, 
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with 80MB of draw buffers, can accommodate single- 
precision volume data sets up to 256 X 256 X 505 or 
512 X 512 X 127 and double-precision sets up to 
256 X 256 X 253 or 512 X 512 X 64. Of course, con- 
figurations with fewer FBMs or smaller draw buffers 
accommodate proportionally smaller maximum 
volume data set sizes. In a single interactive study 
session, the volume data set needs to be down- 
loaded to the FBMs and partitioned into subvolumes 
only once. The set may be rendered many times 
under the control of an interactive user who is vary- 
ing viewing direction, sampling frequency, render- 
ing method, and other parameters. 

Kubota Volume Rendering Stages 
The fundamental operation on which all the Kubota 
volume rendering operations are based is the resam- 
pling and interpolation of the volume data on paral- 
lel projected rays, as illustrated in Figure 3. In the 
Kubota implementation, the PEs work in parallel, 
each on the sample points within the subvolume 
stored on its local FBM. Thus, the unit for parallel 
processing is the subvolume. Several different sam- 
ple points of a single ray, lying in different sub- 
volumes, may be computed simultaneously 

Each PE produces a projected subimage accord- 
ing to the volume rendering method in use, based 
on the PE's local subvolume. This subimage is also 
stored on the local FBM. Data packets from one TE 
control the processing, but the great volume of data 
traffic is all within FBMs. 

For each computed sample point on a projection 
ray, the PE updates the corresponding pixel of the 
subimage in a way that depends on the volume 
rendering method used. For isosurface rendering, 
the subimage is a depth buffer, which is updated 
subject to a depth comparison if the sample value 
exceeds the specified isosurface threshold value. For 
maximum intensity projection, the subimage is 
a voxel- value buffer, which is updated subject to a 
voxel-value comparison. For multiplanar reformat- 
ting, the update also consists of updating a voxel- 
value buffer, subject to a depth comparison. For ray 
sum, the subimage is an accumulation of voxel val- 
ues multiplied by a constant. 

The result of the parallel projection stage is a set 
of subimage tiles in the FBM draw buffers, with each 
tile representing a part of the projected image of 
the whole volume data set. Of course, the different 
image tiles represent overlapping portions of the 
image in screen space and are not yet stored with 
correctly interleaved addresses. The next volume 



rendering stage recombines the subimage tiles to 
form the whole image and redistributes the pixels 
correctly to the interleaved addresses. The recom- 
bination stage involves reading back the tiled 
subimage data to the TE modules, scan line by scan 
line, and then writing the data back to the FBMs. 
The write-back operation applies value compari- 
son in each rendering mode. 

Further processing stages are possible. The 
projected image resolution that determined two 
dimensions of the sampling frequency in the pro- 
jection stage may not correspond to the desired 
screen image size. Thus, the next stage might be a 
2-D zoom operation to produce an image of the 
desired size. This stage is implemented in TE mod- 
ule code with input coming from the stored image 
of the recombination stage. The 2-D zoom can use 
point sampling or bilinear interpolation, depend- 
ing on the sampling chosen for the projection 
stage. 

The isosurface rendering method requires a shad- 
ing stage that involves another read-back cycle. This 
cycle computes the normal vectors by differencing 
the depth values and applies the depth-gradient 
shading and the depth cueing interpolations. This 
shading stage uses the ordinary geometry-based 
rendering support provided by the TE modules. 

Finally, a further image merging stage may be 
used to combine the rendered isosurface with an 
image produced by multiplanar reformatting, using 
depth comparisons. To show a slice through a vol- 
ume bounded by an isosurface, the depth compari- 
son may show the pixel from the deeper surface 
rather than from the nearer surface, as is usually the 
case in geometry-based rendering. 

All stages subsequent to the projection stage 
involve 2-D computations and so represent a small 
amount of computational work relative to the mas- 
sive computation of the 3-D projection stage. 

Performance and Speed/Resolution 
Trade-offs 

A meaningful low-level volume rendering perfor- 
mance metric is trilinear interpolations per second 
(TRIPS). Most of the computational work in the 
expensive projection stage is for performing trilin- 
ear interpolations. The measured performance of 
the Kubota accelerator in this metric on 8-bit voxel 
data is 6()(),()()() TRIPS per PH. As expected, this 
metric scales linearly with the number of PEs, so 
a 20-FBM configuration can achieve 12 million TRIPS. 
The corresponding measured performance on 16-bit 
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voxel data is 475,000 TRIPS per PE. A 20-FBM config- 
uration can achieve 95 million TRIPS. 

Currently, there are no recognized benchmarks 
to use as high-level volume rendering performance 
metrics. Practical tests can be expressed in terms of 
the size of the volume data sets that can be ren- 
dered with good interactive frame rates. Of course, 
the rendering speed depends strongly on the ren- 
dering parameters that affect quality, particularly 
the 3-D sampling frequency. 

The ability to interactively change the rendering 
parameters abets the interactive use. For example, a 
considerable amount of the interaction typically 
consists of rotating the volume model about one or 
more axes (with respect to the view direction) and 
then stopping in a particular position to carefully 
examine the image. An application can set the sam- 
pling frequency to a coarser value (e.g., 10 frames 
per second) while rotating to get smooth motion 
with less accurate images, and then rerender 
the data with a finer sampling frequency to show 
a more accurate image when the user stops in the 
desired position. 

Software Interface 

The fundamental firmware routines that implement 
the Kubota volume rendering capability are accessi- 
ble through an application programming interface. 
This interface permits users to perform volume 
rendering in a windows environment like the 
X Window System. The interface includes routines 
to manage image memory for volume rendering, to 
download and manipulate volume data sets, and 
to produce screen images by the volume rendering 
methods discussed in this paper — all while effi- 
ciently exploiting the parallel processing capabili- 
ties of the Kubota accelerator. 

Appendix: Conceptual Review 

The application of computing systems and compu- 
tational methods to produce and manipulate 
images and pictures has historically involved two 
different kinds of data structures: geometry-based 
models and digital images. The body of this paper 
concerns a third kind of data structure, the volume 
data set, which has more recently become impor- 
tant in imaging applications. This Appendix seeks to 
clarify the natures of digital images and geometry- 
based models as a basis for the discussion of their 
roles in volume rendering. It reviews the principal 
concepts, data structures, and operations of com- 
puter graphics and image processing. The review is 



intended for the interested reader who may not be 
well versed in the subject. It is also intended to clar- 
ify for all readers the meanings of the terms used in 
the paper. 

Pixels, Digital Images, and 
linage Processing 

A digital image is simply a two-dimensional (2-D) 
array of data elements that represent color values or 
gray values taken at a set of sample points laid 
out on a regular grid over a plane area. The data 
elements of a digital image are commonly called 
pixels, a contraction of picture elements. A digital 
image can be obtained by scanning and sampling 
a real image, such as a photograph, or by capturing 
and digitizing a real-time 2-D signal, such as the 
output of a video camera. A digital image can be 
displayed on a raster output device. The raster 
device most commonly used in interactive work is 
a cathode-ray tube (CRT). The CRT is refreshed by 
repeated scanning in a uniform pattern of parallel 
scan lines (the raster), modulated by the informa- 
tion in a digital image contained in a frame buffer 
memory. Such a display will be a more-or-less faith- 
ful copy of the original image depending on the val- 
ues of two parameters: the resolution or sampling 
frequency and the pixel depth, which is the preci- 
sion with which the pixel values are quantized in 
the digital representation. In the context of a raster 
display, the pixels are regarded as representing 
small rectangular areas of the image, rather than 
as mathematical points without extent. 

Image processing involves the manipulation of 
digital images produced from real images, e.g., pho- 
tographs and other scanned image data. Image 
processing applications may have several different 
kinds of objectives. One set of objectives concerns 
image enhancement, i.e., producing images that 
are in some sense better or more useful than the 
images that come from the scanning hardware. 

Some image processing applications, which can 
be characterized as image understanding, have the 
objective of extracting from the pixel data higher- 
level information about what makes up the image. 
The simplest of these applications classify the pixels 
in an image according to the pixel values. More 
sophisticated image understanding applications can 
include detection and classification of the objects, 
for example, in terms of their geometry. The term 
computer vision is also used for image understand- 
ing applications that strive for automatic extraction 
of high-level information from digital images. 
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Geometry -based Models and 
Computer Graphics 

Geometry-based models are data structures that 
incorporate descriptions of objects and scenes in 
terms of geometric properties, e.g., shape, size, posi- 
tion, and orientation. The term computer graphics 
generally refers to the activity of synthesizing 
pictures from geometry-based models. The pro- 
cess of synthesizing pictures from models is called 
rendering. 

The fundamental elements of the geometry- 
based models (frequently called primitives) are 
mathematical abstractions — typically points, lines, 
curves, polygons, and other surfaces. The graphics 
application usually defines certain objects made 
from primitives and assembles the objects into 
scenes to be rendered. Usually, the geometry-based 
models used in graphics contain additional data 
that describes graphical attributes and physical 
properties beyond the geometry of the displayed 
objects. Examples of these attributes and proper- 
ties are surface color, the placement and colors of 
light sources, and the parameters that characterize 
how materials interact with light. 

Applications use geometry-based models for 
purposes beyond producing graphics. Models are 
essential for analytical studies of objects, such as 
determination of structural or thermal properties, 
and for supporting automated manufacturing by 
computer-controlled machine tools. 

In earlier eras, instead of raster devices, com- 
puter graphics systems used so-called stroke or vec- 
tor graphics output devices. These devices were 
directly driven by geometric descriptions of pic- 
tures, rather than by digital images. The most famil- 
iar vector systems were the pen plotter and the 
CRT display operated in a calligraphic rather than 
a raster mode. Such stroke devices were driven 
by data structures called display lists, which were 
the forerunner of today's sophisticated three- 
dimensional (3-D) geometry-based models. 

Digital Images and 
Geometry-based Models 

We normally think of a digital image as a data struc- 
ture that is of a lower level than a geometry-based 
model because the data contains no explicit infor- 
mation about the geometry, the physical nature, or 
the organization of the objects that may be pic- 
tured. The data tells how the colors or gray values 
are distributed over the plane of the image but not 
how the color distribution may have been pro- 



duced by light reflected from objects. On the other 
hand, the digital image is a more generally applica- 
ble data structure than the geometry-based model 
and therefore may be used in applications that have 
no defined geometric objects. 

Historically image processing and geometry- 
based computer graphics have been distinct activi- 
ties, performed by different people using different 
software and specialized hardware for different 
purposes. Recently, however, beginning with the 
advent of raster graphics systems, the distinction 
has become blurred as each discipline adopts tech- 
niques of the other. 

For example, high-quality computer graphics 
uses image processing techniques in texture map- 
ping, which combines digital images of textures 
with geometric surface descriptions to produce 
more realistic-looking or more interesting images 
of a surface. A good example of texture mapping is 
the application of a scanned image of a wood grain 
to a geometrically described surface to produce a 
picture of a wooden object. Because of its fine-scale 
detail, geometric modeling of the wood grain is 
impractical. More generally, one major problem of 
raster graphics is aliasing, which is the appearance 
of unwanted artifacts clue to the finite sampling fre- 
quency in the raster. Some techniques now used in 
raster graphics to ameliorate the effects of aliasing 
artifacts are borrowed from image processing. 

Conversely, the image understanding applica- 
tions of image processing involve the derivation of 
geometric model information from given images. 
In other imaging application areas, one has infor- 
mation at the level of a geometric model for the 
same system that produced the image data, and 
there is naturally an interest in displaying the 
geometric modeling information and the imaging 
information in a single display. Thus, for example, a 
remote sensing application may want to combine 
earth images from sale! lite-borne scanning devices 
with geographic map drawings, which are based on 
geometrical descriptions of natural and political 
boundaries. 

Dimensionality and Projection 
The digital image is intrinsically 2-D because real 
images, even before the sampling that produces 
digital images, are all 2-D. That is, they have only 
two dimensions of extent, whether they exist on 
sheets of paper, on photographic film, on work- 
station screens, or on the retinas of our eyes. (True 
3-D images exist in the form of holograms, but 
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these are not yet generally available as computer 
output devices, so we do not consider them further 
in this paper.) 

In some application areas, such as integrated- 
circuit design, many geometry-based models may 
be strictly 2-D. But since the world is 3-D, many 
engineering and scientific application areas today 
use 3-D geometry-based models. In these models, 
the points, lines, curves, and surfaces are all 
defined in a 3-D model space. Although lines and 
curves have one dimension of extent and surfaces 
have two dimensions of extent, in a 3-D applica- 
tion, these figures all lie in an ambient 3-D space, 
not all contained in any single plane in the model 
space. 

Hence, all 3-D visualization techniques, whether 
based on geometric models or based on the volume 
data sets discussed in this paper, use some kind of 
projection mapping from the 3-D model space to 
a 2-D view plane. The simplest kind of viewing 
projection, the one most frequently used in engi- 
neering graphics and in the volume rendering 
implementations described in this paper, is called 
orthographic projection. This projection is along a 
family of parallel lines to a plane that is perpendicu- 
lar to all of them (see Figure 3). The common direc- 
tion of the family of parallel projection lines is 
called the viewing direction. 

The fact that the viewed images are 2-D poses a 
basic problem of 3-D graphics: How do you convey 
to the viewer a sense of the 3-D world by means of 
viewing 2-D images? An extremely important tech- 
nique for solving this problem is to give the viewer 
interactive control over the viewing projection. 
The ability to change the viewpoint and viewing 
direction at will is a great aid to understanding 
the 3-D situation from the projected 2-D image, 
whether the image is produced by rendering from 
3-D geometric models or by the volume rendering 
techniques discussed in this paper. 

Visualization by Pixels and Voxels 
The power of raster systems to display digital 
images vastly increases when we recognize certain 
aspects of data visualization. We can make digital 
images from data that are not intrinsically visual 
or optical and that do not originate from scanning 
real visible images or from rendering geometrical 
surfaces by using illumination and shading models. 
We can display virtually any 2-D spatial distribution 
of data by sampling it on a regular 2-D grid and 
mapping the sampled values to gray-scale values or 



colors. By viewing the displayed image, a viewer 
can gain insight into and understanding of the 
content of the 2-D data distribution. The term 
pseudocolor is used frequently to mean using col- 
ors to give visual representation to other kinds of 
data that have no intrinsic significance as color. 
This approach to data visualization provides a pow- 
erful tool for assimilating and interpreting 2-D 
spatially distributed data, in much the same way as 
geometi7-based graphics have for centuries pro- 
vided a powerful tool, graphing, for visualizing 
quantitative relationships in all realms of analytical 
science. 

The most familiar examples of image renditions 
of data that are not intrinsically image data come 
from medical imaging. In ordinary X-ray imaging, 
real images are formed by exposing photographic 
film to X-radiation passing through the subject. 
However, the newer medical imaging modalities, 
such as CT scanning, MRJ, and ultrasound, and the 
techniques of nuclearmedicine (PET and SPECT) use 
various kinds of instrumentation to gather non- 
visual data distributed over plane regions. These 
procedures then use computer processing to cast 
the data into the form of digital images that can be 
displayed in pseudocolor (or pseudo gray scale) for 
viewing by the medical practitioner or researcher. 

Many other examples of data visualization by pix- 
els abound. For example, in satellite-borne remote 
sensing of the Earth's surface, scanners gather data 
in several different spectral bands of electromag- 
netic radiation, both visible and nonvisible. The 
user can glean the geophysical information by view- 
ing pseudocolor displays of the scanned informa- 
tion, usually after processing the information to 
classify surface regions according to criteria that 
involve combinations of several spectral values. 
Other examples come from the display of 2-D data 
distributions measured in the laboratory, as in fluid 
dynamics, or acquired in the field, as in geology. 

Volume data sets and voxels are natural general- 
izations of digital images and pixels. They represent 
data sampled on regular grids but in three dimen- 
sions instead of two. The idea of volume visualiza- 
tion or volume rendering extends to volume data 
sets the idea of using images to represent arbitrary 
2-D data distributions. Because the final viewed 
images are necessarily 2-D, however, volume render- 
ing is substantially more complicated than simple 
pseudocolor representation of 2-D data. Although 
volume rendering uses ideas similar to those used 
in 2-D image processing, such as the methods 
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of resampling and interpolation, it also requires 
techniques similar to those used in rendering 3-D 
geometric models, such as geometric transforma- 
tions and viewing projections. Thus, the Kubota 
3D imaging and graphics accelerator, which is 
designed to provide both image processing and 
3-D graphics, is especially well suited for volume 
rendering applications. 
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Development of Digital's PCI Chip 
Sets and Evaluation Kit for the 
DECchip 21064 Microprocessor 

The DECchip 21071 and the DECchip 21072 chip sets were designed to provide simple, 
competitive devices for building cost-focused or high-performance PCI-based sys- 
tems using the DECchip 21064 family of Alpha AXP microprocessors. The chip sets 
include data slices, a bridge between the DECchip 21064 microprocessor and the PCI 
local bus, and a secondary cache and memory controller. The EB64+ evaluation 
kit, a companion product, contains an example PC mother board that was built 
using the DECchip 21064 microprocessor, the DECchip 21072 chip set, and other 
off t he-shelf PC components. The EB64+ kit provides hooks for system designers 
to evaluate cost/performance trade-offs. Either chip set, used with the EB64+ evalu- 
ation kit, enables system designers to develop Alpha AXP PCs with minimal design 
and engineering effort. 



The DECchip 21071 and the DECchip 21072 chip sets 
are two configurations of a core logic chip set for 
the DECchip 21064 family of AJpha AXP micropro- 
cessors. 1 The core logic chip set provides a 32-bit 
PCI local bus interface, cache/memory control 
functions, and all related data path functionality to 
the system designer. It requires minimal external 
logic. The EB64+ kit is an evaluation and develop- 
ment platform for computing systems based on the 
DECchip 21064 microprocessor and the core logic- 
chip set. The EB64+ kit also served as a debug plat- 
form for the chip sets. The DECchip 21071 and the 
DECchip 21072 chip sets and the EB64+ evaluation 
kit were developed to proliferate the AJpha AXP 
architecture in the industry by providing system 
designers with a means to build a wide range of 
uniprocessor systems using the DECchip 21064 pro- 
cessor family with minimal design and engineering 
effort. 2 

The core logic chip set and the EB64+ evaluation 
kit were developed by two teams that worked 
closely together. This paper describes the goals of 
both projects, the major features of the products, 
and the design decisions of the development teams. 



The Core Logic Chip Set 

This section discusses the design and development 
of the two configurations of the core logic chip set. 
After presenting the project goals and the overview, 
the section describes partitioning alternatives and 
the PCI local bus interface. It then details the mem- 
ory controller and the cache controller and con- 
cludes with discussions of design considerations 
and functional verification. 

Project Goals 

The primary goal of the project was to develop a 
core logic chip set that would demonstrate the high 
performance of the DECchip 21064 microprocessor 
in desktop and desk-side systems with entry prices 
less than $4,000. The chip set had to be system inde- 
pendent and had to provide the system designer 
with the flexibility to build either a cost-focused 
system or a high-performance system. 

Another key goal was ease of system design. The 
chip set had to include all complex control func- 
tions and require minimal discrete logic on the 
module so that a system could be built using a 
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personal computer (PC) mother board and off-the- 
shelf components. 

Time-to-market was a major factor during the 
development of the chip set. The DECchip 21064 
microprocessor had been announced nearly five 
months before we started to develop the core logic 
chip set. Digital wanted to proliferate the Alpha 
AXP architecture in the PC market segment; how- 
ever, the majority of system vendors required some 
core logic functions in conjunction with the micro- 
processor to aid them in designing systems quickly 
and with low engineering effort. Providing these 
interested system vendors with core logic chip set 
samples as soon as possible was very important 
to enable the DECchip 21064 microprocessor to 
succeed in the industry. 

To determine the feature set that would meet the 
project goals, we polled a number of potential chip 
set customers in the PC market segment to under- 
stand their needs and the relative importance of 
each feature. We kept this feedback in mind during 
the course of the design and made appropriate 
design decisions based on this data. The following 
subsections describe the final chip set partitioning, 
the trade-offs we had to make in the design, and the 
design process. 

Chip Set Overvieiv 

The chip set consists of three unique designs: 

■ DECchip 21071-BA data slice 

■ DECchip 21071-CA cache/memory controller 

■ DECchip 21071-DA PCI bridge 

It can be used in either a four-chip or a six- chip 
configuration. 

The DECchip 21071 chip set consists of four 
chips: two data slices, one cache/memory con- 
troller, and one PCI bridge. This configuration was 
developed for a cost-focused system; it provides a 
128-bit path to secondary cache and a 64-bit path 
to memory. Cache and memory data have 32-bit 
parity protection. 

The DECchip 21072 chip set consists of six chips: 
four data slices, one cache/memory controller, and 
one PCI bridge. Intended for use in a performance- 
focused system, this configuration provides a 128-bit 
path to secondary cache and a 128-bit path to mem- 
ory. The system designer can choose between 
32-bit parity or 32-bit error correcting code (ECC) 
protection on cache and memory data. 



Figure 1 is a block diagram of an example system 
using the core logic chip set. For a list of compo- 
nents used in a typical system built with this chip 
set, see the EI364+ Kit Overview section. 

The processor controls the secondary cache by 
default. It transfers ownership of the secondary 
cache to the cache controller when it encounters 
a read or a write that misses in the secondary cache. 
The cache controller is responsible for allocating 
the cache on CP I memory reads and writes, and for 
extracting victims from the cache. The cache con- 
troller is also responsible for probing and invalidat- 
ing the secondary cache on direct memory access 
(DMA) transactions initiated by devices on the PCI 
local bus/ 

The ownership of the address bus, sysAdr, is 
shared by the processor and the PCI bridge. The 
processor is the default owner of sysAdr. When 
the PCI bridge needs to initiate a DMA transaction, 
the cache controller performs the arbitration. 

Data is transferred between the processor, the 
secondary cache, the data slices, and the cache/ 
memory controller over the sysData bus, which 
is 128 bits wide. In the 4-chip configuration, each 
of the two data slices connects to 64 bits of the 
sysData bus. In the 6 -chip configuration, each of 
the four data slices connects to only 32 bits of the 
sysData bus, leaving 32 data bits available for use 
as ECC check bits for memory and cache data. The 
cache/memory controller connects to the lower 
16 bits of the sysData bus to allow access to its con- 
trol and status registers (CSRs). 

Data transfers between the PCI and the proces- 
sor, the secondary cache, and memory take place 
through the PCI bridge and the data slices. The PCI 
bridge and the data slices communicate through 
the epiBus. The epiBus contains 32 bits of data 
(cpiData), 4 byte enables, and the data path control 
signals. We defined the epiBus control signals 
so thai the PCI bridge chip operation is indepen- 
dent of the number of data slices in the system. 
Furthermore, the epiBus control signal definitions 
allow the epiData bus width to be expanded to 64 
bits without changing the design of the data slice. 

The system designer can link the system to an 
expansion bus, such as the Industry Standard 
Architecture (ISA) bus or the Extended Industry 
Standard Architecture (EISA) bus, by using a PCI- 
to-ISA bridge or a PCI-to-ElSA bridge. The Intel 
82378IB and 823~ T 5EB bridges, for example, are av ail- 
able in the market for the ISA and the EISA buses, 
respectively.' 
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Figure 1 Core Logic Chip Set Configurations in a System Block Diagram 



Partitioning Alternatives 
As a result of our customer visits, we found that the 
fol lowing features were important for cost-focused 
systems. The features, which affect the partition- 
ing, are listed in descending order of importance. 

■ Low cost for the chip set 

■ Low chip count 

■ Parity protection on memory 

■ Inexpensive memory subsystem 

The following features were identified as impor- 
tant for performance-oriented, server-type systems 
(in descending order of importance). 

■ High memory bandwidth 

■ Chip set cost 

■ Low chip count 

■ ECC-protected memory (This is a requirement in 
a server system.) 



During the feasibility stages, we decided to sup- 
port a 128-bit secondary cache data path and not 
offer optional support for a 64-bit cache data path. 
We felt that a system based on the DECchip 21066 
microprocessor, which supports a 64-bit cache 
interface, would meet the cost and performance 
needs in this segment of the market. 5 Keeping in 
mind the importance of time-to-market, we decided 
that the added flexibility in system design alterna- 
tives was not worth the additional design and verifi- 
cation time required to incorporate this feature. 

We decided to provide an option between 64-bit- 
wide memory and 128-bit-wide memory. The wider 
memory data path provides higher memory band- 
width but at an additional cost. The minimum mem- 
ory that the system can support with a 128-bit-wide 
memory data path is double that supported by a 
64-bit memory data path. Memory upgrades are also 
more expensive. For example, with 4-megabyte 
(MB) single in-line memory modules (SIMMs), the 
minimum memory supported by a 64-bit memory 
data path is 8 MB (two SIMMs); with a 128-bit mem- 
ory data path, it is 16 MB. Memory increments with 
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a 64-bit data path are 8 MB each, and with a 128-bit 
data path are 16 Mi3 each. We decided that the per- 
formance of the 64-bit memory data path was 
sufficient for a cost-focused system; however, for 
memory-intensive applications in the server mar- 
ket, 128-bit-wide memory was necessary. 

One alternative we explored could have pro- 
vided all the features of a cost-focused system in a 
chip set of three chips, using two identical 208-pin 
data path slices and one 240-pin control ler that pro- 
vided the PCI bridge, cache controller, and memory 
controller functions. This configuration, however, 
would have been restricted to 64-bit memory width 
and parity protection on memory. Thus it would 
not have met two of the four desirable features of 
a high-performance system. 

The partitioning that we chose permitted us 
to satisfy the requirements of both cost-focused 
and performance-oriented systems. By splitting 
the design into three unique chip*: a data slice, a 
cache/memory controller, and a PCI bridge, we met 
the requirements of a cost-focused system with 
the 4-chip configuration. All 4 chips are 208-pin 
packages, costing roughly the same as the 3-chip 
alternative. This partitioning scheme allowed us to 
support a 128-bit-wide data path to memory and 
ECC protection with the addition of 2 data slices 
at relatively low incremental cost. Thus it met the 
requirements of a performance-focused system. We 
could not support ECC with the 64-bit-wide mem- 
ory due to pin-count constraints, but we felt that 
this trade-off was reasonable given that cost was 
more important than ECC-protccted memory in this 
market. This partitioning scheme had the added 
advantage of presenting a single load on the PCI 
local bus, as opposed to the two loads prese nted by 
the 3-chip configuration described above. 

Another alternative was to provide a 4-chip con- 
figuration with 128-bit-wide, ECC-protecled mem- 
ory. This would have required the data slices to be 
of higher pin count and therefore higher cost, thus 
penalizing the cost-focused implementation. 

PCI Local Bus Int erface 

The PCI local bus is a high-performance bus 
intended for use as an interconnect mechanism 
between highly integrated peripheral controller com- 
ponents, peripheral add-in boards, and processor/ 
memory subsystems. Interfacing the DECchip 21064 
family of CPUs to the PCI local bus opens up the 
Alpha AXP architecture to what promises to be an 
industry-standard, plug-and-play interconnect for 



PCs. The PCI bridge provides a fully compliant host 
interface to the PCI local bus. This section describes 
some features of the PCI bridge. 

The PCI bridge includes a rich set of DMA transac- 
tion buffers that allows it to perform burst transfers 
of up to 64 bytes in length with no wait states 
between transfers. We optimized our design for nat- 
urally aligned bursts of 32 bytes and 64 bytes 
because this would eliminate the need for a large 
address counter and because we discovered through 
research that most PCI devices in development 
would not perform DMA bursts longer than 64 b)tes. 

DMA Write Buffering We chose a DMA write 
bul ler size of four cache blocks. This size would 
allow for two PCI peripheral devices to alternate 
bursts of 64 bytes each, thus maximizing use of PCI 
bandwidth. We organized the DMA write buffer as 
four cache block entries (four addresses) to simplify 
the cache/memory interface. In addition, this would 
allow the data buffers to be used efficiently when- 
ever 32-byte bursts were in use. 

DMA Read Buffering We designed the DMA read 
buffer to be able to store a fetch cache block and a 
prefetch cache block. As with the DMA write buffer, 
the DMA read buffer is organized to allow for effi- 
cient operation during both 64-b>i:e and 32-byte 
bursts. Prefetching is performed only if either the 
initiating PCI command type or a programmable 
enable bit indicates that the prefetch data will likely 
be used. This allows the system designer to com- 
bine 32-byte and 64 -byte devices without sacrific- 
ing cache/memory bandwidth. To minimize typical 
DMA read latency while maintaining a coherent 
view of memory from the PCI, we designed the cap- 
ability for DMA read transactions to bypass DMA 
write transactions, which are queued in the 
DMA write buffer, as long as the DMA read address 
does not conflict with any of the valid DMA write 
addresses. Because most DMA read addresses are 
not expected to conflict, typical DMA read latency 
does not suffer as a result of the relatively deep DMA 
write buffer. 

Scatter Gather Address Mapping (S/G Mapping) 
The PCI bridge provides the ability to map virtual 
PCI addresses to physical locations in main memory. 
Because each 8 -kilobyte (kB) page can be mapped 
to an arbitrary physical page in main memory, 
a virtual address range that spans one or more 
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contiguous pages can be mapped to pages that are 
physically scattered in main memory, thus the 
nameS/G mapping. Using this mechanism, software 
designers can efficiently manage memory while 
performing multiple-page DMA transfers. 

Although our inclusion of S/G mapping offers 
efficiency benefits to software designers, it also 
presented us with design challenges in the areas 
of performance and cost goals. The PCI bridge per- 
forms address translation by using incoming PCI 
physical addresses to index into a lookup table. 
Each incoming PCI transaction requires the PCI 
bridge to perform an address translation. A simple 
implementation might store the entire lookup table 
in local static random-access memory (RAM). To 
avoid use of this costly component and correspond- 
ing chip set pin allocations, our designers opted to 
store the lookup table in main memory. To mini- 
mize the performance impact of storing the table in 
main memory, the designers incorporated an on- 
chip translation lookaside buffer (TLB) for storing 
the eight most recently used translations. To keep 
things simple, we implemented a circular TLB 
replacement algorithm. 

PCI Byte Access Support To successfully incor- 
porate Alpha AXP CPUs into PC environments, 
we required collaboration across the corporation. 
Digital engineers defined a software/hardware 
mechanism that allows the 32-bit/64-bit Alpha AXP 
architecture to coexist with components on the PCI 
local bus that require arbitrary byte access granular- 
ity. This mechanism requires that low-order address 
bits be used to encode byte lane validity. Imple- 
menting this mechanism reduces the density of I/O 
registers in the address space and conveys byte lane 
validity information through the address itself. 

I/O write performance in this address space suf- 
fers because each CPU- initiated I/O transaction can 
convey only up to 64 bits (a quadword) of data and 
byte lane validity information. To allow for full 
utilization of the DECchip 21064 microprocessor's 
32-byte internal write buffer during I/O writes to 
devices that do not require byte granularity, the 
chip set designers implemented an address range 
that does not perform byte lane decoding. In this 
space, up to 32 bytes can be transferred from the 
CPU and burst onto the PCI in a single transaction. 
This allows for efficient bandwidth utilization dur- 
ing writes to I/O devices that exhibit memory-like 
interfaces, such as video adapters with directly 
accessible frame buffers. 



Guaranteed Access Time Systems that support 
LISA or ISA expansion buses must be able to provide 
a guaranteed maximum read latency from EISA/ISA 
peripherals to main memory (2.5 microseconds for 
EISA, 2.1 microseconds for ISA). This requirement 
presented a challenge for us during our design. In 
the worst c^ise, a simple memory read request from 
an EISA/ISA peripheral can result in significant 
latency due to our use of deep DMA write buffering 
and S/G mapping. Although our decision to allow 
DMA reads to bypass DMA writes provides systems 
with a typically low latency, this feature does not 
avoid worst-case high latency. To meet the EISA/ISA 
worst-case requirements, we included in our 
design PCI sideband signals and cache/memory arbi- 
tration sequences that allow for guaranteed main 
memory access time. When guaranteed access time 
is required, the EISA/ISA bridge must signal the PCI 
bridge by asserting a PCI sideband signal. In 
response, the PCI bridge will flush its DMA write 
buffers, hold ownership of the cache/memory, and 
signal readiness to the EISA/ISA bridge. When the 
EISA/ISA transaction starts, this sequence guaran- 
tees that the path to main memory is clear and will 
therefore have guaranteed access time. 

Memory Controller 

The memory controller supports up to eight banks 
of dynamic random-access memory (DRAM) and 
one bank of dual-port video random-access mem- 
ory (VRAM). Each memory bank can be selectively 
programmed to enable two subbanks, which allows 
the memory controller to support double-sided 
SIMMs that have two row address strobe (RAS) lines 
per bank. The memory controller thus has the flexi- 
bility to support system memory sizes of 8 MB to 
4 gigabytes (GB) of DRAM and 1 MB to 8 MB of VRAM. 
System designers can choose to implement mem- 
ory by banks of individual DRAMs or SIMMs, either 
on board or across connectors. The memory con- 
troller is able to support a wide range of DRAM sizes 
and speeds across multiple banks in a system, 
by providing separate programmable bank base 
address, configuration, and timing registers on a 
per-bank basis. 

We designed the memory control ler for system 
flexibility by supporting fully programmable mem- 
ory timing with 15 -nanosecond (ns) granularity. 
This programmability supports SIMM speeds rang- 
ing from 100 ns clown to 50 ns. Each memory bank's 
timing is programmed through registers that con- 
sist of DRAM timing parameters to control counters. 
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Some examples of programmable timing parame- 
ters used to control the memory interface are "row 
address setup," "read CAS width," and "CAS pre- 
charge." As the memory controller sequences 
through a memory transaction, these programmed 
counters control the exact timing of RAS, column 
address strobe (CAS), the DRAM address bits, and 
write enables. At the same time, the memory con- 
trol ler sends commands from the cache/memory 
controller chip to the data slice chips to control the 
clock edge for sending and receiving memory data 
on DRAM writes and reads, respectively. 

One customer is currently using one of the banks 
in combination with medium-scale integration (MSI) 
components to interface to a very slow memory 
bus that supports flash read-only memories (ROMs), 
nonvolatile RAM, and light-emitting diodes (I.I*Ds). 
Since the original design was not done with a very 
slow memory interface in mind, this demonstrates 
that the chip set provides flexible, programmable 
timing functionality independent of the system. 

The memory controller allows the system 
designer to build an inexpensive graphics subsys- 
tem using a video frame buffer on the memory data 
bus, and a low-cost video controller on an expan- 
sion bus like the ISA bus. The system designer can 
achieve competitive graphics performance by 
using the processing power of the CPU for graphics 
computations and the existing high-bandwidth 
memory data path for transferring data between 
the graphics computation engine (the CPU) and the 
frame buffer. The interface between the memory 
controller and the video controller is very 7 econom- 
ical: only two control signals are required to time 
the transfer of screen data from the random-access 
memory of the VRAM to the serial-access memory of 
the VRAM. The video controller is responsible for 
transferring the data from the serial memory of the 
VRAM to the screen. 

Although we designed the memory controller to 
be flexible, we also included features that improved 
performance. Two such features are optimizations to 
reduce memory read latency and selective support 
for use of page mode between memory transactions. 

To minimize memory read latency, the memory 
controller prioritizes reads above writes pending in 
the memory write buffer. For a CPU memory read, 
the memory controller waits six system cycles after 
the last read data before servicing a pending write, 
unless the memory write buffer is full. At least six 
system cycles occur between the time the memory 
controller latches the last read data from the DRAMs 



and the time a subsequent read request could be 
issued by the DECchip 21064 processor. Because 
memory write transactions take longer than six 
cycles to complete, our choice to delay the execu- 
tion of a pending write allows read latency to be 
reduced for the following read. Waiting six system 
cycles after a read is a significant performance 
improvement for successive reads with cache vic- 
tims because every read is accompanied by a write. 

We also chose to improve performance by selec- 
tively determining which memory transactions 
would benefit most by staying in page mode. The 
memory controller stays in page mode after a DMA 
read burst and between successive memory writes. 
Page mode is not supported between CPU memory 
read transactions since the RAS precharge time can 
typically be hidden between successive CPU read 
requests. 

Cache Controller 

The secondary cache interface logic is partitioned 
across the cache/memory controller chip and the 
data slice chips. The cache/memory controller chip 
contains the address path and control logic, and the 
data slice chips provide buffering for four cache 
lines of data to and from memory. We designed 
the cache controller to be system independent and 
flexible so that it could be designed into a wide 
range of systems. 

The chip set supports a direct -mapped, write- 
back secondary cache with a data width of 128 bits 
and a cache line fixed at 32 bytes. The chip set 
allows the system designer to choose a secondary 
cache size ranging from 128 kB to 16 MB, as deter- 
mined by software configuration. The speed of 
the cache RAMs must be fast enough to support the 
chip sets read access time of one system cycle. 
Writes to the cache can be programmed to take one 
or two system cycles. The write enables can be pro- 
grammed to have a half-cycle or full-cycle pulse 
width when writing the cache during fill cycles. 
This feature was added to give the system designer 
flexibility in meeting SRAM write-enable specifica- 
tions with various system cycle times. 

Another feature added to the cache controller to 
provide flexibility is the support of an optional allo- 
cation policy on CPU writes. The write-back sec- 
ondary cache is always allocated on CPU memory 
read misses. The option to allocate the cache on 
CPU memory write cache misses is programmable 
and can be disabled by software during system ini- 
tialization. We chose to provide this option since 
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disabling cache write allocation can allow higher 
memory write bandwidth. This feature can be used 
by system designers to determine whether particu- 
lar applications have better performance when sec- 
ondary cache write allocation is disabled. 

The cache controller provides arbitration 
between the CPU and the PCI bridge chip for sec- 
ondary cache ownership. The arbitration policy 
is programmable and varies the level of control 
the PCI bridge has in keeping the ownership of the 
secondary cache during DMA transactions. 

Although we designed the cache controller for 
system flexibility, we also included features that 
would give it performance advantages. One such 
feature is the memory write buffer. The cache con- 
troller uses the memory write buffer to store four 
cache lines of data for cache victims, DMA writes, 
CPU-noncacheable writes, and CPU-cacheable writes 
when write allocate mode is disabled. The buffer is 
organized as first in, first out (FIFO) on cache-line 
boundaries. Successive writes to the same cache 
line are not merged into the buffer because the CPU 
chip write buffer performs this function. The cache 
controller allows CPU and DMA reads to bypass the 
write buffer as long as the read address does not 
conflict with any of the write addresses. The mem- 
ory write buffer improves performance by allowing 
timely acknowledgment of write transactions. Read 
bypassing of the write buffer improves perfor- 
mance by reducing memory read latency. 

Global Design Considerations 
This section briefly discusses some of the decisions 
concerning silicon technology, packaging technol- 
ogy, and internal clocking of the chip sets. 

Silicon Technology' The design team chose to use 
an externally supplied gate-array process that 
offered quick time-to-market and low cost. Most 
chips designed in the Semiconductor Engineering 
Group are manufactured using Digital's proprietary 
complementary metal-oxide semiconductor (CMOS) 
processes, which emphasize high speed and high 
integration. Our chips' performance and complex- 
ity — 30-ns cycle time, approximately 35,000 gates 
per chip— did not require these capabilities. Gate- 
array technology offered shorter design times and 
quicker turnaround times than Digital's custom sili- 
con technology. 

Packaging Technology? When choosing a pack- 
age, the design team considered issues of package 
and system cost, design partitioning, and heat 



produced by power dissipation. Some of these 
issues are discussed in the Partitioning Alternatives 
section. 

We chose to put all three chips in 208-pin plastic 
quad flat packages (PQFPs). The 208-pin PQFP is one 
of the most popular low-cost, medium pin-count, 
surface-mount packages. One drawback of PQFPs, 
however, is their low limit on power dissipation. 
To ensure a junction temperature of 85 degrees 
Celsius with 100 linear feet per minute of airflow, 
the power dissipation must be limited to 1.5 watts 
(W). The power dissipation of the data slice is 
about 1.7 W resulting in a junction temperature 
approaching 100 degrees Celsius. We verified that 
reliability was not an issue at a junction of 100 
degrees Celsius. However, we had to ensure that the 
chip timing worked at a junction temperature of 
100 degrees Celsius, as opposed to the 85 degrees 
Celsius we would normally use. We could not use 
this approach on the PCI bridge chip because the 
additional timing optimization required would 
have adversely affected the schedule. We had to 
take special measures in the design to keep the 
power dissipation within the 1.5-W limit. We imple- 
mented conditional clock nets for large blocks of 
registers that are loaded infrequently, such as the 
CSRs and the TLB. 

Internal Clocking To achieve the shortest possi- 
ble cross chip set latencies, we implemented a four- 
phase clock system. A four-phase system allows 
data to be transferred from one section of the chip 
set to another in less than a full clock cycle if logic 
delays are sufficiently small. 

In contrast to approaches based on latch designs, 
which can offer lower gate-count implementations, 
we chose to use mostly edge-triggered devices. 
We viewed this as an opportunity to simplify the 
design analysis and timing verification process by 
keeping the number of timing reference points to 
four clock edges. 

To further simplify the clocking system, the 
designers chose to make the PCI clock and the cache/ 
memory clock synchronous to each other. This 
approach avoids the need for synchronizers (and 
corresponding synchronizer delays) between clock 
domains; it also reduces the number of clock speed 
combinations to be verified. Although the syn- 
chronous approach does not allow the system 
designer to decouple the PCI clock speed from the 
cache/memory clock speed, we felt that the added 
complexity and verification effort required to sup- 
port asynchronous clocks would not be worth the 
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small degree of flexibility that would be gained 
from such a design. 

Functional Verification 

Given the short design schedule and the require- 
ment that first-pass prototypes be highly functional 
for customers, the team adopted a strategy of 
pseudorandom testing at the architectural level 
of the chip set as a whole. We fell that this strat- 
egy would test more of the design more quickly 
and would find more subtle and complex bugs than 
a testing methodology focused on the gate/register 
level of each separate chip. 

The DECSIM simulation environment included 
models for the three chips, a DECchip 21064 bus 
functional model (BFM), a PCI BFM, a cache model, 
a memory model, and some "demon" models that 
could be programmed to pseudorandomly gener- 
ate events such as the assertion of the video port 
inputs or the injection of errors. We developed 
SEGt'E templates and used them in a variety of exer- 
cisers to generate DECSIM scripts pseudorandomly. 6 

To keep the testing environment from being 
overly complicated, we allowed users to pseudoran- 
domly configure only those aspects of the design 
that significantly altered the operation of the con- 
trol logic. Many configurable aspects of the chip set 
and simulation environment (for example, the PCI 
S/G map) were not varied in the exercisers and 
were tested with simple focused tests. 

In addition to programming BFMs to read back and 
check data, we built a variety of checkers into the 
simulation environment to verify correct operation 
of RAM control timing, PCI protocol, tristate bus con- 
trol, PCI transaction generation, data cache invali- 
date control on the DECchip 21#64 CPU, and many 
other functions. At the end of every exerciser run, 
the secondary cache and memory were checked for 
coherence and correct error protection. 

The verification efforts of the team resulted in 
the removal of over 200 functional bugs, ranging 
from simple bugs to quite complex and subtle bugs, 
prior to the fabrication of first-pass prototypes. We 
found no "show stopper" bugs in the core func- 
tions required for first-pass prototype chips, and 
we used simple work-arounds for the few bugs that 
we did find in the first-pass design. 

The EB64+ Evaluation Kit 

This section of the paper discusses the develop- 
ment of the EB64+ evaluation kit. After presenting 
the project's goals and the overview of the kit, it 



discusses some of the module design issues that 
were addressed during the design of the EB64+ 
module. This section concludes with performance 
results of benchmarks run on the EB64+ system. 

Project Goals 

The first and most important goal of the EB64+ eval- 
uation kit project was to provide a sample design 
for customers using the DECchip 21064 micropro- 
cessor and the DECchip 21071 and the DECchip 
21072 chip sets. Another major goal was to provide 
an evaluation and development platform that used 
standard PC components. These two goals would 
enable a customer to evaluate their design trade- 
offs quickly and to complete their system design 
faster and with a better chance of success. 

Secondary goals were to provide a development 
and debug environment for the core chip set and to 
provide a high-performance benchmarking system 
for the microprocessor and core chip set. The 
EB64+ kit also serves as a platform for hardware and 
software development for PCI I/O devices. 

EB64+ Kit Overview 

Figure 2 shows a block diagram of the EB64+ mod- 
ule, a full-size PC (12.0 inch by 13.0 inch) mother 
board. The major components on the module are 
given below: 

■ DECchip 21064 microprocessor (150 megahertz 
[MHzl to 275 MHz) 

■ Secondary cache (512 kB, 1 MB, or 2 MB) 

■ Secondary cache address buffer 

■ Interrupt/configuration programmable array 
logic (PAL) device 

■ Serial ROM interface for the m icroprocessor 

■ System clock generator: oscillator, phase-locked 
loop (PLL), clock buffers 

■ Core logic chip set 

■ Two secondary cache control PALs 

■ PCI bus peripherals: embedded small computer 
system interface (SCSI) and Ethernet 

■ PCI bus arbiter 

■ Intel 82378115 bridge between the PCI and TSA 

buses 

■ Three ISA expansion slots 

■ Eight slots of standard 36 -bit memory SIMMs 

■ Memory control signal buffers 
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Figure 2 Block Diagram of the EI364+ Module 



Secondary Cache Size and Speed 
The DECchip 21064 processor has programmable 
secondary cache read and write access times with a 
granularity equal to the processor clock cycle time. 
For instance, if the read access time is 25 ns, the pro- 
grammed value for a 150-MHz processor (6.6-ns cycle 
time) would be 4, and the programmed value for a 
200-MHz processor (5 -ns cycle time) would be 5. 

One of the more difficult decisions for any sys- 
tem designer is to determine the optimal cache size 
and speed in terms of cost and performance. The 
EB64+ module supports various cache size and 
speed options in order to allow a system designer 
to evaluate the difference between a large, slow 
cache and a small, fast cache. The trade-off here is 
usually between lower cost for the 512-kB cache 
and higher performance for the 2-MB cache. The 
2-MB cache uses four 128K by 9 SRAMs and twelve 
128K by 8 SRAMs for the data store, and the 512-kB 



cache uses four 32K by 9 SRAMs and twelve 32K by 
8 SRAMs. 

We decided to share data RAM footprints between 
the 32K by 8 SRAMs and the 128K by 8 SRAMs, thus 
allowing the system designer to build two different 
modules: one with a 512-kB cache and the other 
with a 2-MB cache. The designer can evaluate the 
speed-to-size trade-off by using faster SRAMs for 
the smaller cache and slower SRAMs for the larger 
cache. The system designer can choose to evaluate 
the effect of varying the cache size from 512 kB, to 

1 MB, to 2 MB, without varying the cache speed, 
by configuring jumpers to disable portions of the 

2 MB cache on an EB64+ module. 

System Clocking Design 

System clocking for the EB64+ module presented a 
challenge in two different areas. The first area was 
the high-frequency input clocks needed by the 



Digital Technical Journal Vol. 6 A' a 2 Spring 1994 



57 



DECchip 21071/21072 PCI Chip Sets 



DECchip 21064 microprocessor. The input clocks 
operate at twice the frequency of the DECchip 
21064 CPU, requiring a 300- to 550-MHz oscillator 
for the EB64+ module. Initially, an emitter-coupled 
logic (ECL) output oscillator was used for this pur- 
pose. The main drawback to this solution was the 
cost, which is in the $40 to $50 range. The other dis- 
advantage was the long lead time and nonrecurring 
engineering charges associated with unique oscilla- 
tor frequencies. 

By working closely with a vendor of gallium 
arsenide (GaAs) devices, we were able to provide an 
alternative in the $8 to $18 range. The device con- 
sists of a low-frequency oscillator and a PLL that mul- 
tiplies the low-frequency oscillator to provide the 
high-frequency input that the processor requires. 
For example, a 300-MHz frequency clock is gener- 
ated using a 30-MHz oscillator connected to a PLL 
that multiplies this by 10 to provide the 300-MHz 
input. Since lower frequency oscillators are pro- 
duced by more vendors, the lead times and nonre- 
curring engineering charges for unique frequencies 
are minimal when compared to the ECL output 
oscillators. 

Generating the clocks for the other system com- 
ponents was quite challenging. The core logic chip 
set, PCI devices, and the cache control PALs together 
require three types of clock signals: the first clock 
is in phase with the processor's sysClkOut clock sig- 
nal; another clock is 90 degrees phase shifted from 
the first; and a third clock has twice the frequency 
of and is in phase with the first. The frequency of 
sysClkOut is an integral divisor (between 2 and 17) 



of the processor's internal clock frequency. Some 
divisors may result in a sysClkOut duty cycle that is 
not 50 percent. A PLL is used to generate both the 
phase-shifted and the double-frequency clock. It 
also guarantees a 50 percent duty cycle, which is 
required for the PCI clock. 

Figure 3 illustrates how the LR64+ module gener- 
ates the three system clocks from the processor's 
sysClkOut signal. In addition to the PLL, the system 
clock generator uses low-skew clock buffers to 
drive the final device loads of the system. One out- 
put of the clock buffers is used to provide the feed- 
back to the PLL. This causes the overal I delay from 
sysClkOut to the system clock to be close to zero. 

Design Evolution 

As noted previously, the EB64+ kit was developed 
to provide an example design to external customers 
as well as provide a debug and development plat- 
form for the core logic chip set. The focus of the 
design evolved during the project. 

Initially, we included several features on the 
EB64+ module to support the various modes of 
the chip set. As the design progressed, an updated 
version of the EB64+ module was developed. The 
final version focused more on being a sample 
design than a debug and development platform for 
the chip set. Some of the features that fell into this 
category are listed below. 

■ Initially, the EB64+ module supported both the 
64-bit and 128-bit memory on the same module 
with configuration jumpers. This design affected 
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performance because 64 bits of the cache data 
bus were routed to two data slice chips. The 
final version of the EB64+ module supports only 
128-bit memory. This change allowed us to 
reduce the cache read access time on the 
DECchip 21064 processor by 3 ns, thus reducing 
the programmed 2-MB cache read access time 
for a 200-MHz DECchip 21064 processor from 
7 cycles to 6 cycles. 

■ Certain modes of the chip set were controlled by 
configuration jumpers initially. These have been 
redefined to support additional cache sizes and 
speeds to support a wider range of evaluation 
and benchmarking. 

Performance 

Figures 4 and 5 show the results of the BYTE maga- 
zine portable CPU/floating-point unit (FPU) bench- 
marks run on an EB64+ system running the 
Windows NT operating system. The EB64+ system 
has a 128-bit memory subsystem with 70-ns (RAS 
access time) DRAMs The 150-MHz, 166-MHz, and 
200-MHz benchmarks were run using a DECchip 
21064 microprocessor with a 512-kB cache with a 



28-ns read access time. The 275-MHz benchmark 
was run on a DECchip 21064A microprocessor 
with a 2 MB cache with a 35-ns read access time. 
The benchmarks for the DECchip 21066 processor 
were run on an EB66 system with a 256-kB cache. 
The figures show the performance relative to other 
Windows NT systems available in the market today. 
The benchmark data for the Intel486 DX2-66 and 
Pentium 60-MHz chips and for the MIPS Computer 
Systems' R4400SC processors was taken from 
the Alpha AXP Personal Computer Performance 
Brief— Windows NT? 

Table 1 compares the bandwidths on an EB64+ 
system using the two possible chip set configura- 
tions, a 200-MHz processor, and 70-ns DRAMs. 

Summary 

The DECchip 21071 and the DECchip 21072 chip sets 
and the EB64+ evaluation kit met their project goals 
by helping to proliferate the Alpha AXP architecture 
in the PC market. Several customers, as well as 
some groups within Digital, use the chip sets in 
their systems today. Many of these customers and 
internal groups have used the EB64+ platform as a 
basis for their designs and as a means of initiating 
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Figure 5 EB64+ System Performance Benchmarks 



Table 1 Comparison between a 64-bit Memory Data Path and a 128-bit Memory Data Path 
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Analysis of Data Compression 
in the DLT2000 Tape Drive 

The DLT2000 magnetic tape drive is a state-of-the-art storage product with a 
1.25M-byte-per-second data throughput rate and a lOG-byte capacity, without 
data compression. To increase data capacity and throughput rates, the DLT2000 
implements a variant of the Lempel-Ziv (LZ) data compression algorithm. An LZ 
method was chosen over other methods, specifically over the Improved Data 
Recording Capability (1DRC) algorithm, after performance studies showed that the 
LZ implementation has superior data throughput rates for typical data, as well as 
superior capacity. This paper outlines the two designs, presents the methodology 
and the results of the performance testing, and analyzes why the LZ implementa- 
tion is faster, when the JDRC hardware implementation had twice the bandwidth 
and was expected to have faster throughput rates. 



Overview 

Data compression, a method of reducing data size 
by coding to take advantage of data redundancy, is 
now featured in most tape drive products. Two com- 
pression techniques in widespread use are (1) an 
arithmetic coding algorithm called Improved Data 
Recording Capability (IDRC) and (2) variants of the 
general Lempel-Ziv (LZ) compression algorithm. 
Current tape products that implement these algo- 
rithms are IBM's fast (a maximum throughput rate 
of approximately 3M bytes per second [M bytes/s]) 
and relatively expensive (originally about $60K) 
family of half-inch, 36-track tape products, which 
have employed the IDRC algorithm for about five 
years. More recently, the 8-millimeter (mm) helical 
scan tape products began incorporating IDRC data 
compression. Also, some 4-mm helical scan digital 
audiotape (DAT) products now use a variant of the 
LZ algorithm, as do some quarter-inch cartridge 
(QIC) tape products. 

In developing a complex product like an industry- 
leading tape drive, it is difficult to determine at the 
beginning of the project the design that will have 
the best performance characteristics and meet 
time/cost goals. When Digital included data com- 
pression in the plans for its DLT2000 tape product, 
the choice was not clear regarding which compres- 
sion technology would best enhance the tape drive's 
data transfer rate and capacity. Keeping within cost 
constraints and incurring an acceptable level of risk 



to the development schedule were important fac- 
tors as well. The options were greatly limited, how- 
ever, because the schedule was too short for the 
engineering team to implement a compression 
method on a silicon chip designed specifically for 
the DLT2000 tape drive; therefore, the team needed 
to find a compression chip that was available 
already or would be soon. 

Another important consideration was that the 
compression method used on the DLT2000 tape 
drive would likely be used on future digital linear 
tape (DLT) products. For media interchangeability, 
such products would have to be able to write and 
read media compatible with the DLT2000 tape drive. 
New products that used different compression 
methods would require extra hardware to handle 
both types of data compression. Since extra hard- 
ware adds significant cost and complexity to prod- 
ucts, the use of different compression methods is 
undesirable. Also, to meet future data throughput 
needs, the compression method used on the 
DLT2000 tape drive had to support the significantly 
higher data transfer speeds planned. If the compres- 
sion chip used initially was too slow for future prod- 
ucts, it had to be at least possible to develop an 
implementation of the same compression algorithm 
that would be fast enough for future DLT products. 

To gain more expertise in applying data compres- 
sion technology to tape drives, the tape develop- 
ment group investigated several designs using 
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various data compression chips. Eventually, we 
created about 20 DLT2000 engineering prototype 
units, each of which used one of the two most com- 
mon data compression methods: IDRC and an LZ 
variant. The specific Lempel-Ziv variant used was 
designated Digital Lempel-Ziv 1 (DLZ1). 1 2 We tested 
the performance of the prototype units and studied 
the results to check for consistency with our expec- 
tations. Such analysis was important since tape 
drive performance with data compression was a 
new area for the engineering team, and the inter- 
play of higher tape transfer rates, new gate arrays, 
compression chip, memory buffers, new firmware, 
and host tape applications is complex. 

Figure 1 shows the basic design of the data path 
on the DLT2000 tape drive's electronics module. 
(Microprocessors, most gate arrays, firmware read- 
only memories [ROMs], and other electronic com- 
ponents are not shown.) Note that the data cache 
size is effectively increased because it contains com- 
pressed data. The data processing throughput of the 
compression chip, however, can potentially be a 
bottleneck between the cache and the small compu- 
ter systems interface (SCSI) bus. The IDRC compres- 
sion chip can process data at throughput rates of up 
to 5M bytes/s, whereas the DLZl chip can process 
data at rates of up to about 2.5M bytes/s when com- 
pressing data and up to about 3M bytes/s when 
decompressing data. In each design, the memory 
and data paths outside the compression chip were 
designed to be adequate for the compression chip 
used. 

One major goal of this study was to quantify the 
performance of each implementation to determine 
if the lower throughput of the DLZl chip was a prac- 
tical disadvantage in the DLT2000 product. The 
IDRC version of the DLT2000 product, with its maxi- 
mum throughput rate of 5M bytes/s, would seem to 
have a clear throughput advantage, but the typical 
compression ratio and the data rate to the tape 



media are significant factors in the overall through- 
put of the tape drive. 

The development group expected the IDRC and 
DLZl chips to have approximately the same com- 
pression ratio (i.e., the result of dividing the number 
of units of data input by the number of units of data 
output). The DLZl ratio would possibly be slightly 
higher. The group based their expectation on com- 
parisons of results from several studies. 23,4 These 
studies reported compression ratios for various 
types of data on implementations that used either 
the IDRC algorithm or an LZ algorithm but not both. 

Compressing data within the tape drive has a 
multiplying effect on the drive's throughput rate, as 
seen by a host computer. If the uncompressed data 
throughput rate to the tape media is 1.25M bytes/s 
and the data compression ratio is 2.0:1 (or 2.0), the 
expected average data transfer rate is 1.25 X 2.0 = 
2.5M bytes/s. Since the development group thought 
that the typical compression ratio of each imple- 
mentation was 2.0:1, and because the DLZl chip 
would tend to become a bottleneck as data rates 
approached the chip's maximum throughput rate, 
the group expected the IDRC prototype to be at least 
as fast as the DLZl prototype for a given data set. 

Testing showed, however, that the DLZl DLT2000 
prototype consistently, and significantly, surpassed 
the IDRC prototype in both metrics! To ensure the 
correctness of the IDRC implementation used on 
the prototype DLT2000 and thus confirm the unex- 
pected result, the group verified the IDRC compres- 
sion efficiency results by testing two other tape 
products that use the IDRC algorithm. Given identi- 
cal data sets, the benchmark test results were con- 
sistent with those of the IDRC DLT2000 prototype. 

The marked difference between the DLZl and 
IDRC prototypes can be mainly attributed to the dif- 
ferences in the compression efficiencies of the two 
algorithms. Relatively low compression ratios on 
the IDRC unit limit its throughput capabilities. The 
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author believes that the discrepancy between the 
results of the DLT2000 prototype testing and 
the results of the earlier studies can be explained 
by two factors: variations in the data sets used and 
differences in media format. 

First, the compression efficiency for different 
samples of data, even if of the same type, e.g., 
PostScript data, can vary widely. The data sets 
tested on the DLT2000 prototypes were not identi- 
cal to those tested in the earlier studies. 

Second, some tape drive implementations com- 
bine IDRC data compression with a feature IBM calls 
autoblocking (also known as superblocking). This 
coupling occurs when the tape drive has a media 
format that contains interrecord gaps (lRGs) whose 
number is inversely proportional to the tape block 
(record) size used (sometimes linear). Autoblock- 
ing minimizes the number of IRGs by automatically 
using a large, fixed on-tape block size (e.g., 64K 
bytes). The autoblocking feature packs multiple 
compressed blocks from the host into the larger 
blocks on the media. 4 Reducing the number of IRGs 
on such tape formats is important because IRGs are 
wasted space. If block sizes are small, the number 
of IRGs will be large and the tape capacity sig- 
nificantly reduced. Tape products that combine 
autoblocking with IDRC compression derive an 
increased capacity from both techniques. 

These two factors, however, were not relevant to 
the test results of our study, i.e., the favorable DLZl 
findings. We performed the DLT2000 prototype 
testing with tape drives that were virtually identical 
except for the compression technology used. Also, 
the data samples, tools, and test environments were 
the same. 

From the test results and analysis we concluded 
that, when compared with the IDRC implementa- 
tion, the DLZ1 implementation combines consis- 
tently superior cartridge capacity (25G bytes at a 
compression ratio of 2.5:1) and superior data 
throughput for most types of real data. The testing 
did not reveal any real data types that compressed 
better with the IDRC technique than with the DLZl 
technique. In addition, the DLZ1 technique is 
supported by the strong prospect of future DLZ1 
compression chips that will greatly increase the 
maximum data throughput rates. This addresses 
the concern that the DLZl technique should sup- 
port a growth path in data throughput rate for 
future members of the DLT product family. 

The remainder of this paper outlines the opera- 
tion of the IDRC and DLZl compression techniques, 



discusses what testing was done and how, presents 
the test data, and gives an analysis of the results. 

Description of the IDRC and DLZl 
Compression Algorithms 

This section provides some historical/industrial 
background on the IDRC and DLZl algorithms and 
some cursory information on how they work. 
An in-depth technical presentation of these (or 
other) compression techniques is beyond the scope 
of this paper. For more details on their operation 
and mathematics, please refer to the references. 

The IDRC Com pression Algorithm 
IBM developed the IDRC algorithm and employs this 
technique on some members of the Model 3480 and 
Model 3490 tape subsystems. EXABYTE Corporation 
is currently licensing the IDRC algorithm from IBM. -1 

The IDRC algorithm is a lossless, adaptive arith- 
metic compression technique. Arithmetic com- 
pression encodes data by creating an output string 
that represents a sequence of fractional numbers 
between 0 and 1. Each fraction is the result of the 
product of the probabilities of the preceding input 
symbols. 1 5 (>J 

The IDRC technique has two modes: byte oriented 
and binary (bit) oriented. On input, bytes are com- 
pared with the last byte processed. If three or more 
consecutive b\tes are found to be equal, processing 
occurs on a byte-by-byte basis. Otherwise, the data 
is compressed bit by bit, 6 

Parallel recording implementations for which 
the number of IRGs is a capacity issue (for example, 
the IBM Model 3490 product) usually combine IDRC 
compression with autoblocking. Since autoblocking 
reduces the number of IRGs (assuming that a smaller 
block size is commonly used), the effectiv e increase 
in tape capacity due to autoblocking surpasses the 
increase that compression alone would yield. 

In some tape implementations, though, data is 
packed into fixed-size blocks on the media whether 
or not compression is used. If done efficiently, this 
packing makes tape capacity on such products 
independent of block size. 

The DIZ1 Compression Algorithm 
A number of variations of the Lempel-Ziv algorithm 
(also referred to as the Ziv-Lcmpel algorithm) 
have been implemented and are in wide use in the 
industry today. Some examples are the common PC 
compression software tools PKARC, PKZIP, and 
ZOO; the compression method built into the 



64 



Vol. 6 No. 2 Spring 1994 Digital Technical Journal 



Analysis of Data Compression in the DLT2000 tape Drive 



MS-DOS Version 6.0 system; and Hewlett-Packard's 
HP 7980XC tape drive. IBM recently announced that 
it has developed a high-speed (40M bytes/s) com- 
pression chip that uses the LZ algorithm. In addi- 
tion, STAC Electronics' data compression products 
and the QlC-122 data compression standard use 
derivatives of the LZ algorithm. 4 5 

Lempel-Ziv methods generally replace redundant 
strings in the input data with shorter symbols. The 
methods are lossless and adapt to the input data. 
Implementations typically simplify the general 
algorithm in one or more ways for practical reasons, 
such as speed and memory requirements for string 
storage. 1 • rU ^ H 

The LZ variant used in the DLZ1 implementation 
maps variable-length strings in the input to variable- 
length output symbols. During compression, the 
algorithm builds a dictionary of strings, which is 
accessed by means of a hash table. Compression 
occurs when input data matches a string in the table 
and is replaced with the corresponding dictionary 
symbol. The dictionary itself is not output to the 
tape media but is rebuilt during decompression. 1 

When the dictionary fills up with strings, the algo- 
rithm cannot adapt to new patterns in the data. For 
this reason, the dictionary needs to be reset period- 
ical ly. The DLT2000 DLZ1 algorithm resets the dic- 
tionary on each logical block boundary. Thus, the 
compression efficiency can vary according to the 
block size, as well as with the actual data. With 
small blocks, the dictionary is typically still adapt- 
ing to the input data when the block ends and the 
dictionary is reset. This tends to keep the compres- 
sion algorithm from reaching full efficiency. For 
example, with an LZ variant similar to the DLZl, the 
LZW algorithm presented in Welch's "A Technique 
for High-Performance Data Compression," com- 
pression efficiency increases rapidly as the block 
size used goes from 1 byte to about 8K bytes. 3 The 
efficiency peaks at about 12K bytes, and larger 
block sizes show good but gradually decreasing 
compression efficiencies. The initial input block 
range that exhibits rapid improvement in compres- 
sion efficiency (1 byte to 8K bytes, in this case) is 
referred to as the "adaptation zone." 

Test Procedures 

The development group carried out three main sets 
of tests. 

1. Tests that measured the compression efficiency 
on an OpenVMS system and on an ULTRIX system, 
which is based on the UNIX system 



2. Tests that measured the compression efficiency 
and the data throughput in a high-throughput 
test system environment 

3- Benchmark tests that measured the IDRC com- 
pression ratios on two other tape products 

The DLT2000 firmware measured the compres- 
sion ratios precisely by comparing the block size 
(in bytes) before and after compression, during 
write command processing. In the benchmark 
tests, compression ratios were calculated from 
total tape capacities with and without compression 
enabled. We repeated the DLT2000 tests with minor 
variations in test parameters; the results suggested 
an uncertainty of approximately ±1 percent in the 
measurements. 

Test configurations were identical in system 
type, test software, and operating system versions. 
We often used the same test bed and varied only the 
tape unit under test, i.e., the DLZl or the IDRC. The 
hardware and firmware on the different DLT2000 
prototypes were identical to ensure that factors 
such as diagnostic code overhead and clock speed 
did not skew test results between the DLZl and the 
IDRC units, or between test runs. We also varied 
some parameters and repeated tests to ensure that 
the measured performance characteristics were 
consistent with and reflective of the final product. 

Operating System -based Tests 
Since the system configurations used could not 
supply data fast enough for conclusions to be made 
regarding the DLT2000 tape drive's maximum 
throughput rates, compression efficiency was the 
focus of the operating system testing. Test param- 
eters were still chosen to minimize throughput 
bottlenecks in the host system. For each test, the 
data was set up on a single disk on each of two sys- 
tems—an OpenVMS system and a UNIX system. 

OpenVMS Tests The OpenVMS system used in the 
tests was a clustered MicroVAX 3400 machine with a 
KZQSA adapter for the SCSI bus. The MicroVAX 3400 
system was running the OpenVMS Version 5.5-2 
operating system and used the standard backup 
utility (BACKUP) to write data to the DLT2000 tape 
drive. Although compression efficiency was the 
focus of the operating system testing, we selected 
the following BACKUP options to maximize system 
throughput as much as possible: 

■ /NOCRC. This option disables a cyclic redun- 
dancy check (CRC) calculated and stored in the 
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tape block by BACKUP for extra data integrity 
protection. Since the CRC calculations are CPU 
intensive, they were disabled to minimize system 
bottlenecks. 

■ /BLOCK_SIZE=65()24. A block size of 65,024 mini- 
mizes host and SCSI bus overhead to a reasonable 
degree. 

■ /GROUP_SIZB=0. This option disables the cre- 
ation of (and the writing to tape of) an exclusive 
OR (XOR) block calculated by BACKUP. By 
default, BACKUP would create one XOR block for 
every 10 data blocks. We disabled XOR blocks 
because their presence would probably decrease 
the compression ratio and system throughput. 

We tested the following types of data on the 
OpenVMS system. 

■ Bin — the BACKUP of a set of binary files, mainly 
executable files 

■ Sys — the image BACKUP of the system disk 

■ C— the BACKUP of the DLT2000 product's 
firmware source library, primarily C code and 
include files 

UNIX Tests The I NIX configuration used for test- 
ing was a DECsystem 5500 system running the 
ULTRIX Version 4.2c operating system. The SCSI 
common access model (CAM) software driver was 
used, running on this machine's native SCSI port. 
The standard ULTRIX tar and dd utilities were used 
to copy the following data to the tape: 

■ Text — ASCII text files of product documentation 
manuals 

■ PS— PostScript versions of the manuals 

■ tar — tar backup of the system disk 

■ HarGra — the chart and art files shipped with the 
standard Harvard Graphics software package 

■ ValLog — the files containing the gate array 
design database, which was built using Valid 
Logic tools 

Throughput Tests 

The throughput tests were performed on PC-based 
Adaptec SDS-3 SCSI development/test systems. The 
development team chose this test environment to 
do repeatable, high-performance testing because it 
is relatively unconstrained by disk, file system, CPU, 
or application software bottlenecks for the perfor- 
mance range of the DLT2000 tape drive. 



We tested the following data types on the SDS-3 
system: 

■ Binary — an OpenVMS VAX object file 

■ Source— C source code 

■ VAXcam — a VAXcamcra image file in PostScript 
format 

■ HarGra — a collection of chart and art files 
shipped with the standard Harvard Graphics 
software package 

■ Paint — a complicated Paintbrush file, in bitmap 
format 

■ Ones — an all ones (hex FF) pattern 

■ Repeat — a string of 24 unique characters, 
repeated as needed 

SCSI bus protocol overhead can be somewhat 
high on an SDS-3 system, and compression ratio and 
throughput rate can vary depending on the tape 
block size. Consequently, all measurements were 
taken using 64K-byte tape blocks. This block size 
minimizes per-command overhead on the SCSI bus, 
as well as in the host. With high enough compres- 
sion ratios, however, this overhead was still a limit- 
ing factor for 64K-byte blocks on the IDRC testing, as 
will be shown later in the SDS-3 Test Results section. 

Another factor in SCSI bus performance is 
whether synchronous or asynchronous data trans- 
fer mode is used. Asynchronous transfer mode 
requires a full handshake to transfer each data byte, 
which can seriously decrease the bandwidth of the 
SCSI bus in many configurations. Synchronous 
transfer mode (period/offset = 200/7) was enabled, 
which tends to minimize the effect of cable length 
on performance. 

For a given data type, the same amount of data, 
i.e., from 50JV1 bytes to 300M bytes, was transferred 
to both versions of the tape product. We often per- 
formed several test runs using different amounts of 
data to check the consistency of the test results. 

To maximize the applicability of the test results, 
we wanted to use "real world" data. To do so in our 
test environment was not practical or would have 
introduced delays between blocks, thus ruining 
any throughput measurements. We obtained a com- 
promise in the following manner. The SDS-3 tool 
we used is limited by a 64K-byte buffer for high- 
speed transfers. That buffer can be used repeatedly, 
and the direct memory access (DMA) pointers auto- 
matically "wrap around" back to the start when 
they reach the end of the buffer. We created a tool 
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that takes the first 64K bytes from a file with the 
desired test data, reformats the data, and writes 
the data to an output file compatible with the 
SDS-3 software. This "buffer file" can then be 
uploaded into the SDS-3 tool's memory buffer, thus 
duplicating the first 64K bytes of the data from the 
test file in SDS-3 memory The tool has an obvious 
limitation; the first 64K bytes of data might not be 
representative of the rest of the data in the file. 
Using this tool was, however, a practical way to 
transfer at least subsets of real data into the 
throughput test environment. 

Benchmark Tests 

Since preliminary results of our study indicated 
that the IDRC chip has a lower compression ratio 
than that indicated by previous studies, the bench- 
mark tests were performed primarily to confirm 
the compression efficiency of the IDRC DLT2000 
implementation. 4 For the benchmark tests, we 
tested two tape products that use IDRC compres- 
sion implementations. 

The first product tested was Digital's TA91 tape 
drive (which is compatible with an IBM 3480E 
tape drive) configured on a Hierarchical Storage 
Controller (HSC) in a VAXcluster configuration. 
A collection of chart and art files included with 
the standard Harvard Graphics software package 
composed the data set. This identical data set was 
written to an JDRC DLT2000 tape drive for accurate 
comparison. 

The second benchmark product tested was an 
EXB-8505 tape drive, which also uses IDRC com- 
pression. 9 We tested the EXB-8505 tape drive on an 
SDS-3 test system. The data set used was the first 
64K bytes of the text of the U.S. Constitution. We 
compared the compression ratio obtained on the 
EXB-8505 with the compression ratio for the same 
data written to a DLZ1 DLT2000 unit and with text 
data compressed on an IDRC DLT2000 tape drive. 
(The text data on the IDRC implementation was dif- 
ferent from the text data on the EXB-8505 and DLZ1 
implementations because an IDRC prototype was 
no longer readily available when the U.S. Consti- 
tution data became part of the tests.) We also per- 
formed some throughput tests to compare the 
DLZ1 DLT2000 and the EXB-8505 drives. 

We measured the native product capacity of the 
TA91 and EXB-8505 tape drives by writing to the end 
of tape (EOT) with compression disabled. We then 
repeated this test with compression enabled. 



Test Results 

The compression ratios shown in the test results 
are calculated by dividing the number of bytes of 
uncompressed data by the number of bytes of the 
same data when compressed. Therefore, a compres- 
sion ratio of 2.0:1, or simply 2.0, means that the data 
compressed to one-half its original size, and if main- 
tained for that whole tape, such compression 
would effectively double the data capacity of the 
tape drive. 

Operating System Test Results 
Figure 2 shows the measurements of compression 
ratio on the OpenVMS and UNIX systems. The differ- 
ence between the compression ratios of the DLZ1 
prototype and those of the IDRC prototype is strik- 
ing on the graph. The DLZ1 prototype had signifi- 
cantly higher compression ratios for all the data 
types tested. Note that these results, as compared 
to the results of the SDS-3 testing, are more repre- 
sentative of the real world, since most of these data 
sets came from live multimegabyte databases. 

We tested the ULTRDC dump utility on the same 
system and data on which we ran the tar utility. 
The dump utility compression ratios were almost 
identical to those obtained with the tar utility This 
result was not surprising since the bulk of the data 
stored was identical— only the metadata created 
by the utility varied. For comparison purposes, the 
average compression ratio for these data types was 
2.76 for the DLZ1 prototype and 1.54 for the IDRC 
prototype. 

AJthough compression measurements were the 
focus of the operating system-based tests, for gen- 
eral information, we also took some throughput 
measurements. The DECsystem 5500 system run- 
ning the dd utility achieved write rates of approxi- 
mately 0.85iM bytes/s for the data types. Running 
the tapex utility's performance test (which is not 
disk or file system limited) on a similar machine 
resulted in rates of more than 3M bytes/s. The 
3M-byte/s rate implies that, when running dd or tar, 
the disk and/or file system is the likely bottleneck, 
since the ULTRIX drivers, SCSI channel, and tape 
driver were capable of three times the throughput. 
(Other possibilities are inefficiencies within dd 
and/or tar, inefficient handling of two devices on 
the SCSI bus, insufficient CPU horsepower, etc.) 

OpenVMS tests showed similar results for the 
BACKUP utility, but the throughput is likely to have 
been limited by the KZQSA adapter. Other tests 
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indicate that the KZQSA has a limit of 0.8iM bytes/s to 
0.9iM bytes/s with the OpenVMS system. 

The informal operating system throughput test- 
ing confirms that the particular configurations 
tested are not suitable for measuring the bandwidth 
limits of the DLT2000 tape drive, when using the 
standard backup utilities. Note that the newer VAX 
and the Alpha AXP platforms have much higher 
throughput capabilities and are able to more fully 
utilize the capabilities of the DLT2000 product. 
These platforms were not available when we per- 
formed this study. 

SDS-3 Test Results 

The SDS-3 tests measured compression ratios and 
data throughput rates. 

Compression Figure 3 shows the SDS-3 data com- 
pression ratios. The ratios for the first four data 
types are in the normal range, i.e., the DLZ1 proto- 
type averaged approximately 2.4 and the IDRC 
prototype averaged approximately 1.5. For the 
Paintbrush bitmap file, both prototype versions 
compressed at about the same efficiency. 

Although the 30:1 compression ratio for the 
Ones pattern data is not representative of normal 
data, the ratio gives a sense of the maximum effi- 
ciency of the algorithms. The Repeat pattern test 
ratios highlight the ability of the DLZ1 algorithm to 



capitalize on redundant strings of moderate length 
(24 bytes, in this case). The IDRC algorithm lacks 
this ability. None of the many data sets tested 
compressed better with the IDRC algorithm than 
with the DLZ1 algorithm. (We tested six other 
data sets but did not include the test results in this 
paper because they showed little variation from 
those presented.) 

Throughput Rates Figure 4 shows the data 
throughput rates for six of the data types; com- 
pression ratios are annotated at the bottom for 
convenience. The use of a line graph rather than 
a bar graph suggests some correlation between 
compression ratio and throughput. We tested vari- 
ants of these data types to explore the strength of 
this correlation. 

With the DLZl algorithm, we found data sets that 
had the same compression ratio but significantly 
different throughput rates. We saw variations of up 
to ± 0.3M bytes/s from the ''expected" rate, which 
is the native drive rate (1.2^M bytes/s) multiplied by 
the compression ratio. 

The throughput rate with the IDRC algorithm 
tends to correlate more strongly with the compres- 
sion ratio, but we did see variations. For example, 
the VAXcamera data at a compression ratio of 1.4 
transfers about 0.1 M bytes/s faster than Harvard 
Graphics data, which compresses at 1.6. 
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Even more striking is the difference on write and 
read transfer rates. The DLZ1 algorithm is almost 
always significantly faster on decompression. This 
feature is characteristic of this type of LZ algorithm. 
On the other hand, IDRC write and read rates match 
very closely, typically within 0.05M bytes/s. 



The throughput limit of the SDS-3 system used 
was high enough to not usually be a factor. Know- 
ing this fact was essential for the proper inter- 
pretation of test results. A bottleneck in the tape 
device must be distinguishable from an adapter 
or tester limitation. We measured the throughput 
limit of the SDS-3 system by writing and reading the 
Ones pattern and similar data patterns, which are 
highly compressible by the IDRC algorithm. With 
a 64K-byte block size, throughput on the SDS-3 
system peaked at about 3.5M bytes/s. When we 
increased the block size 1M byte, the throughput 
jumped to nearly 4.5M bytes/s. This increase was 
due to reduction in the amount of command over- 
head for a given amount of data being transferred 
on the SCSI bus. None of the normal data types 
tested, except the Paintbrush bitmap files, could 
approach compression ratios high enough to begin 
to push the limits of the SDS-3 system. 

These results indicate that at higher data rates, 
the SDS-3 system becomes a limiting factor. Analysis 
of SCSI protocol handling on the SDS-3 system 
shows that the nondata portions of a transaction 
(e.g., message, command, and status) are handled 
somewhat inefficiently. At high throughput rates, 
this overhead is significant enough to affect 
throughput to the device. Using a larger block size 
reduces this per-command overhead for a given 
amount of tape data and allows a higher through- 
put to be achieved on the SCSI bus. 
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Benchmark Test Results 

We wrote the Harvard Graphics data set repeatedly 
to the TA91 tape drive. With compression disabled, 
about 132M bytes fit on the media. With compres- 
sion enabled, 2l6iM bytes were written, giving 
a compression ratio of 1.64. This ratio compares 
closely with the 1.66 obtained on the IDRC DLT2000 
prototype. 

We then used the SDS-3 tool to repeatedly write 
the first 64K bytes of the U.S. Constitution to the 
EXB-8505 tape drive. With compression disabled, 
about 5G bytes were written. With compression 
enabled, 7.6G bytes were written, giving a compres- 
sion ratio of 1.52. Again, this corresponds closely 
with the compression ratio of 1.54 achieved when 
writing text data on the IDRC DLT2000 prototype. 

We performed more testing for general compari- 
son between the DLZ1 DLT2000 product and the 
EXB-8505 product. The U.S. Constitution data com- 
pressed at 2.23 on the DLT2000 drive and at 1.52 on 
the EXB-8505 drive. Figure 5 shows the results of 
throughput testing with this data on these two 
products, using two block sizes, lOK-byte blocks 
and 64K-byte blocks. 

Conclusions 

The compression efficiency testing outlined in this 
paper indicates that, for most data sets, the DLZ1 
algorithm usually achieves a higher compression 
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ratio than the IDRC algorithm and, therefore, yields 
a consistent capacity advantage over the IDRC algo- 
rithm. The reader should carefully note that regard- 
less of the algorithm used, the actual capacity 
increase that a user might realize with data com- 
pression depends heavily on the specific mix of 
data. The following summarizes the compression 
results presented in this paper. Based on the com- 
pression testing in the operating system environ- 
ment, a DLT2000 product using DLZ1 compression 
has a typical capacity of 25G bytes to 30G bytes. 
A DLT2000 product using IDRC compression would 
typically hold about 15G bytes of data. 

The data throughput testing showed that, in most 
cases, the DLZ1 DLT2000 prototype transferred data 
at a faster rate than the IDRC D1T2000 prototype- 
even though the IDRC prototype's hardware imple- 
mentation was capable of almost twice the data rate 
(5iM bytcs/s for the IDRC drive and 2.5M/3-OM bytes/s 
for the DLZ1 drive). The IDRC implementation did 
not perform better for two reasons. 

1. Given the same data set, the compression ratio 
of the IDRC implementation is almost always less 
than that of the DLZ1 implementation. 

2. The typical compression ratio of the IDRC imple- 
mentation is somewhat low, in an absolute sense 
(less than 1.8). 

Since data compression in the tape device has a 
multiplying effect on data transfer rates seen by the 
host, a low compression ratio limits the practical 
rate at which compressed data can be made avail- 
able to the tape media. 

To transfer data faster than the DLZ1 prototype, 
the IDRC prototype must achieve a compression 
ratio that multiplies the drives native data rate 
beyond the throughput limit of the DLZ1 proto- 
type. This limit is about 2.5iM bytcs/s for write oper- 
ations. Calculating the approximate minimum 
compression ratio (Cr) needed is straightforward, 
as the following steps show: 

Cr X (native data transfer rate) = throughput limit 
Cr X 1.25M bytes/s = 2.5M bytes/s 
Cr = (2.5M bytes/s)/(1.25M bytes/s) 
Cr=2.0 (or 2.0:1) 

Thus, when the IDRC prototype compresses 
data at a rate greater than 2.0: 1, its transfer rate 
should exceed thai of the DLZ1 prototype. Indeed, 
with the Paintbrush and Ones data patterns, the 
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compression ratio was more than 4.0:1, and the 
transfer rate measurements show the throughput 
potential of the IDRC implementation over the DLZ1 
implementation. These data patterns are not typi- 
cal, however, and more realistic data sets (e.g., 
binary, source files, text, and databases) show the 
IDRC algorithm compression ratios to be only in the 
1.5 to 1.7 range. The benchmark testing confirms 
these results and, therefore, the correctness of the 
IDRC DLT2000 implementation. These low IDRC 
compression ratios for typical data are what pre- 
vent the IDRC implementation from achieving its 
throughput potential on the DLT2000 tape product. 

The DLZ1 DLT2000 implementation was adopted 
for the actual DLT2000 tape product. As the devel- 
opment team completed the design, they made 
hardware and firmware improvements to enhance 
the data throughput characteristics of the final 
product. For example, they increased the clock rate 
on the compression chip by 10 percent and opti- 
mized critical firmware code paths. 
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Digital Press 

Digital Press, the authorized publisher for 
Digital Equipment Corporation, is an imprint 
of Butterworth-Heinemann, a major international 
publisher of professional books and a member 
of the Reed Elsevier group. The following are 
descriptions of computing titles available from 
Digital Press. 

OPENVMS AXP INTERNALS AND 
DATA STRUCTURES 

Ruth E. Goldenberg and Saro Saravanan, 
June 1994, hardbound, 1,800 pages, 
ISBN; 55558-120-X ($ 150.00), 

This book describes in vivid detail the internals 
and data structures of the OpenVMS AXP operat- 
ing system Version 1.5. Perhaps the most com- 
prehensive and up-to-date description available 
for a commercial operating system, it is an 
irreplaceable reference for operating system 
development engineers, operating system 
troubleshooting experts, systems programmers, 
consultants, and customer support specialists. 
Some of the text and much of the book's struc- 
ture are derived from its highly successful 
predecessor, VAX/VMS Internals and Data 
Structures: Version 52. The new work is divided 
into nine parts: Introduction; Control Mechanisms; 
Synchronization; Scheduling and Time Support; 
Memory Management; Input/Output; Life of 
a Process: Life of the System; and Miscellaneous 
Topics. Each of the 39 chapters is akin to a case 
study on the topic it covers, based on the depth 
and breadth of treatment. 

THE UNIX PHILOSOPHY 

Mike Gancarz, September 1994, softcover, 
176 est. pages, ISBN: 55558-123-4 ($ 19-95 est.). 

Unlike many books that focus on how to use the 
UNJX operating system, The V NIX Philosophy con- 
centrates on answering the question, "Why use 
UNJX in the first place?" Readers will discover the 
rationale and reasons for such concepts as file sys- 
tem organization, user interface, and other system 
characteristics. In an informative, nontechnical 
fashion, The UNIX Philosophy explores the general 
principles for applying the UNJX philosophy to soft- 
ware development. This book describes complex 
software design principles and addresses the impor- 
tance of small programs, code and data portability, 
early prototyping, and open user interfaces. Writ- 
ten for both the computer layperson and the expe- 



rienced programmer, this book explores the tenets 
of the UNIX operating system in detail, dealing with 
powerful concepts in a comprehensive, straight- 
forward manner. 

VAXCLUSTER PRINCIPLES 

Roy G. Davis, 1993, paperbound, 600 pages, 
ISBN: 55558-112-9 ($49.95). 

This in-depth exploration of the VMS operating 
system is ideal for computer professionals who 
need a thorough understanding of VAXciuster com- 
ponents and functionality to support, manage, and 
develop applications in a VAXciuster environment. 

DIGITAL'S CDD/REPOSITORY: 
A Comprehensive Guide 

Lilian Hobbs and Ken England, 1993, paperbound, 
259 pages, ISBN: 55558-113-7 (S34.95). 

This comprehensive guide focuses on Version 5.0 
of CDD/Repository — an extremely sophisticated 
and powerful repository based on an object- 
oriented approach. This active distributed reposi- 
tory system provides the functionality necessary 
for users to organize, manage, control, and integrate 
tools and applications across their companies. The 
repository simplifies application development by 
prov iding information management and environ- 
ment management features. 

WORKING WITH TEAMLINKS 

Tony Redmond, 1993, paperbound, 446 pages 
(includes diskette), ISBN: 55558-116-1 ($44.95). 

Working with TeamUnks is a practical guide to 
Digital's office system for the Microsoft Windows 
graphical user environment. Its thorough coverage 
will help experienced and inexperienced users, 
programmers, and system implementers realize 
the benefits while avoiding the pitfalls of using 
PCs in an integrated multivendor office system. 
The book shows how the TeamLinks File Cabinet 
works, how TeamLinks mail flows, how to stream- 
line business processes with TeamRoute document- 
routing system, and how to integrate applications 
in a TeamLinks environment. It discusses the prob- 
lems of implementing a PC-based office system and 
of managing the process of migration from ALL-1N-1 
ISO, Digital's minicomputer-based office system. 
An appendix documents TeamLinks internal codes 
and presents other interesting information. A com- 
panion diskette contains many sample programs 
that can be used as a base for your own solutions. 
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NAS ARCHITECTURE REFERENCE MANUAL 

Leo F. Laverdure, Patricia Srite, and John Colonna- 
Romano, 1993, paperbound, 564 pages, ISBN: 
55558-115-3 ($34.95). 

Designed for anyone interested in learning about 
the NAS architecture— including application 
developers, technical consultants, Independent 
Software Vendors (ISVs), Value-Added Resellers 
(VARs), and Digital's Integrated Business Units 
(IBUs) — the NAS Architecture Reference Manual 
provides information on the NAS services and the 
key public interfaces supported by each service. 

NAS: Digital's Approach to Open Systems 

James Martin and Joe Leben, 1993, paperbound, 
412 pages, ISBN: 55558-117-X ($34.95). 

Network Application Support (NAS) is both a 
comprehensive architecture and a set of software 
products. NAS provides a framework that makes 
it possible for applications developers to enhance 
those characteristics of computing applications 
that promote interoperability, application distribut- 
ability, and application portability among applica- 
tions that run on Digital's computing platforms 
as well as applications from other vendors, such 
as IBM, Hewlett-Packard, Sun Microsystems, and 
Apple Computer. For managers, executives, 
and information systems staff, the book describes 
the two types of NAS products: (1) the develop- 
ment toolkits that provide services directly to 
computing applications, both Digital applications 
and user-written applications— this important 
new class of software, called middleware, operates 
as an intermediary between application programs 
and the underlying hardware/software platform; 
and (2) the products that build on this NAS middle- 
ware to provide services directly to the end users 
of computing services. 

USING MS-DOS KERMIT: Connecting Your PC 
to the Electronic World, Second Edition 

Christine M. Gianone, 1992, paperbound, 345 pages 
(includes diskette), ISBN: 55558-082-3 ($34.95). 

Using MS-DOS Keimit is a book/disk package 
designed to help technical and nontechnical PC 
users alike to link their IBM PCs, PS/2s, or compati- 
bles to other computers and data services — e.g., 
Dow Jones News/Retrieval, MCI Mail, databases like 
BBS, DIALOG, and TYMNET, and any mainframe— 
throughout the world. Based on the author's close 



involvement with development and distribution 
of the Kermit transfer protocol, the guide supplies 
easy-to-follow, step-by-step instructions, meticu- 
lously compiled tables, and at-a-glance information 
on important areas. 

USING C-KERMIT: Communication 
Software for OS/2 Atari ST, UNIX, OS-9, 
VMS, AOS/VS, Amiga 

Frank da Cruz and Christine M. Gianone, 1993, 
paperbound, 514 pages, ISBN: 55558-108-0 ($34.95). 

Using C-Kermit describes the new release, 5A, of 
Columbia University's popular C-Kermit communi- 
cation software — the most portable of all commu- 
nication software packages. Available at low cost 
on a variety of magnetic media from Columbia 
University, C-Kermit can be used on computers of 
all sizes, ranging from desktop workstations to mini- 
computers to mainframes and supercomputers. 
The numerous examples, illustrations, and tables 
in Using C-Kermit make the powerful and versatile 
C-Kermit functions accessible for new and experi- 
enced users. 

USING DECWINDOWS MOTIF FOR OPENVMS 

Margie Sherlock, 1993, paperbound, 363 pages, 
ISBN: 55558-114-5 ($34.95). 

The book Using DECwindows Motif for OpenVMS 
is designed to help new OpenVMS DECwindows 
users explore and apply DECwindows tech- 
niques and features and to provide experienced 
DECwindows users with practical information 
about the Motif interface, ways to customize envi- 
ronments, and advanced user topics. OpenVMS 
DECwindows Motif is based on MIT's specification 
for the X Window System, Version 1 1 , Release 4 and 
OSF/Motif 1.1.1. 

X AND MOTIF QUICK REFERENCE GUIDE, 
Second Edition 

RandiJ. Rost, 1993, paperbound, 398 pages, 
ISBN: 55558-118-8 ($24.95). 

Arranged in five sections — X Protocol Reference, 
Xlib Reference, X Toolkit Reference, Motif Refer- 
ence, and General X Reference— and organized 
alphabetically with thumb tabs for quick and easy 
reference, the X and Motif Quick Reference Guide 
provides complete descriptions of routines and 
user-accessible data structures, including Xlib sub- 
routines and macros, X Toolkit Intrinsics routines, 
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iMotif routines, and all of the iMot if Widgets. The 
Second Edition has been updated to reflect new 
functionality in both X Window System, Version 
11, Release 5, and OSF/Motif Version 1.2, including 
routines and XJib to support color management 
system and new routines on Xlib to better provide 
support for internationalization and localization. 

ALL-IN-1: Managing and Programming in V3.0 

Tony Redmond, 1993, paperbound, 552 pages, 
ISBN: 55558-101-3 ($52.95). 

ALL-1N-1: Managing and Programming in V3. 0 
assists both new and experienced ALL-IN-1 system 
managers and programmers to make the best of 
ALL-IN-1 V3.0— the best single release of ALL-IN-1 
since V2.0(1985). 

ALL-IN-1: Integrating Applications in V3.0 

John Rhoton, 1993, paperbound, 265 pages, 
ISBN: 55558-102-1 ($52.95). 

ALL-IN- J: Integrating Applications in V3 0 
helps programmers experienced in third- 
generation languages to use code-level inte- 
gration to (1) integrate non-Digital products 
and applications that may be either difficult 
to integrate using documented ALL-IN-1 features, 
integrated only by incurring significant perfor- 
mance overhead, or integratable without pre- 
servation of the ALL-IN-1 familiar look and feel; 
(2) build applications that surpass performance 
limitations; and (3) access external data stored 
in any format. The book gives system managers 
an overview of code-level integration and diag- 
nostic help for product installations, including 
coverage on relinking the ALL-IN-1 image. 

A BEGINNER S GUIDE TO VAX/VMS UTILITIES 
AND APPLICATIONS, Second Edition 

Ronald iM. Sawey and Troy T. Stokes, 1992, paper- 
bound, 399 pages, ISBN: 55558-066-1 ($2795). 

A Beginner's Guide to VAX/VMS Utilities and Appli- 
cations offers a hands-on introduction to the LDT 
and EVE screen editor programs, the DECspell spell- 
ing checker, WPS-PLUS, phone and mail utilities, 
VAX notes, the DATATRIEVE database management 
program, the DECalc electronic spreadsheet, the 
BITNET network, and more. Included are a wealth 
of lively examples, exercises, and illustrations, plus 
"quick reference" charts summarizing commands 
and operations at the end of each chapter. 



VAX/VMS OPERATING SYSTEM CONCEPTS 

David Donald iMiller, 1992, hardbound, 550 pages, 
55558-065-3 ($44.95). 

Combining discussions of operating system 
theory with examples of its application in key 
VAX/VMS operating system facilities, this book 
provides a thoughtful introduction for appli- 
cation programmers, system managers, and 
students. Each chapter begins with a discussion 
of the theoretical aspects of a key operating sys- 
tem concept — including generally recognized 
solutions and algorithms — followed by an expla- 
nation of how the concept is implemented, plus 
an example that shows the uses and implications 
of the approach. 

DECNET PHASE V: An OSI Implementation 

James Martin and Joe Leben, 1992, clothbound, 
572 pages, ISBN: 55558-076-9 ($49.95). 

Broaden your understanding of how large networks 
are designed with this introduction to the concepts 
surrounding Digital Network Architecture. This 
book blends an OSI tutorial with a complete look at 
how OSI technology is used in a Digital computer 
network. You will gain useful insights into OSI and 
the process of OSI standardization as well as imple- 
mentation—all presented in a straightforward, 
easy-to-follow style. 

INFORMATION IN THE ENTERPRISE: 
It's More than Technology 

Geoffrey Darnton and Sergio Giacoletto, 1992, 
clothbound, 318 pages, ISBN: 55558-091-2 ($34.95). 

This nontechnical book examines the role of infor- 
mation in the broader business enterprise— how to 
use it to gain competitive advantage and to redesign 
business processes for greater efficiency. 

ENTERPRISE NETWORKING: 
Working Together Apart 

Ray Grenier and George Metes, 1992, clothbound, 
260 pages, ISBN: 55558-074-2 ($29.95). 

Focusing on work environments in which 
knowledge workers use electronic networks 
and networking techniques to access, com- 
municate, and share information, this book 
develops strategic and practical approaches 
that distributed organizations can use to succeed 
and compete. 
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Call for Authors 

Digital Press has become an imprint of 
Butterworth-Heinemann, a major interna- 
tional publisher of professional books and 
a member of the Reed Elsevier group. Digital 
Press remains the authorized publisher for 
Digital Equipment Corporation: the two 
companies are working in partnership to 
identify and publish new books under the 
Digital Press imprint and create opportuni- 
ties for authors to publish their work. 

Digital Press remains committed to publish- 
ing high-quality books on a wide variety 
of subjects. We would like to hear from you 
if you are writing or thinking about writing 
a book. 

Contact: Frank Satlow, Publisher 
Digital Press 
313 Washington Street 
Newton, MA 02158 
Tel: (617) 928-2649 



